Using the grapheme-to-phoneme feature in CMU Sphinx-4

Foreword

This article summarizes and updates the previous articles [1] related to the new grapheme-to-phoneme (g2p) feature in CMU Sphinx-4 speech recognizer [2].

In order to support automatic g2p transcription in Sphinx-4 there were created a new weighted finite state transducers (wfst) in java [3] which its current API will be presented in a future article. There were also created various new applications for which its installation procedure and usage will be presented in the following sections.

The procedures presented here were verified using openSuSE 12.1 x64 under a VirtualBox machine, but should apply to all recent linux distributions (either 32 or 64 bit). They assume that you are logged in a user test and all required software is saved under /home/test/cmusphinx directory. As a final note, the various commands outputs where omitted in this article, but should be watched for any errors or information especially in case of troubleshooting.

1. Installation

1.1. Required 3rd party libraries and applications

The following 3rd libraries should be installed in your system, before installing and running the main applications. As a notice these are only required in order to train new g2p models. They are not required if you want to use a g2p model in Sphinx-4.

1.1.1. OpenFst

OpenFst [4] is a library written in C++ for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). You can download the latest version available at [4]. This article uses version 1.3.2.

test@linux:~/cmusphinx> wget http://www.openfst.org/twiki/pub/FST/FstDownload/openfst-1.3.2.tar.gz ... test@linux:~/cmusphinx> tar -xzf openfst-1.3.2.tar.gz test@linux:~/cmusphinx> cd openfst-1.3.2/ test@linux:~/cmusphinx/openfst-1.3.2> ./configure --enable-compact-fsts --enable-const-fsts --enable-far --enable-lookahead-fsts --enable-pdt ... test@linux:~/cmusphinx/openfst-1.3.2> make ... test@linux:~/cmusphinx/openfst-1.3.2> sudo make install ... test@linux:~/cmusphinx/openfst-1.3.2> cd ..

1.1.2. OpenGrm NGram

The OpenGrm NGram library [5] is used for making and modifying n-gram language models encoded as weighted finite-state transducers (FSTs). It makes use of functionality in the OpenFst library to create, access and manipulate n-gram models. You can download the latest version available at [5]. This article uses version 1.0.3.

test@linux:~/cmusphinx> wget http://www.openfst.org/twiki/pub/GRM/NGramDownload/opengrm-ngram-1.0.3.tar.gz ... test@linux:~/cmusphinx> tar -xzf opengrm-ngram-1.0.3.tar.gz test@linux:~/cmusphinx> cd opengrm-ngram-1.0.3/ test@linux:~/cmusphinx/opengrm-ngram-1.0.3> ./configure ... test@linux:~/cmusphinx/opengrm-ngram-1.0.3> make ...

In case the make command fail to complete in 64bit operating systems, try re-executing the configure command and rerun make as follows

test@linux:~/cmusphinx/opengrm-ngram-1.0.3> ./configure LDFLAGS=-L/usr/local/lib64/fst ... test@linux:~/cmusphinx/opengrm-ngram-1.0.3> make ... test@linux:~/cmusphinx/opengrm-ngram-1.0.3> sudo make install ... test@linux:~/cmusphinx/opengrm-ngram-1.0.3> cd ..

1.2. Main applications

1.2.1. SphinxTrain

Having openFst and openGrm libraries installed, the training of a new model can be achieved in SphinxTrain while training a new acoustic model [6].

test@linux:~/cmusphinx> svn co https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/sphinxbase ... test@linux:~/cmusphinx> svn co https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/sphinxbase ... test@linux:~/cmusphinx> svn co https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/pocketsphinx ... test@linux:~/cmusphinx> svn co https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/SphinxTrain ... test@linux:~/cmusphinx> cd sphinxbase/ test@linux:~/cmusphinx/sphinxbase> ./autogen.sh ... test@linux:~/cmusphinx/sphinxbase> make ... test@linux:~/cmusphinx/sphinxbase> sudo make install ... test@linux:~/cmusphinx> cd ../pocketsphinx/ test@linux:~/cmusphinx/sphinxbase> ./autogen.sh ... test@linux:~/cmusphinx/sphinxbase> make ... test@linux:~/cmusphinx/sphinxbase> sudo make install ... test@linux:~/cmusphinx/sphinxbase> cd ../SphinxTrain/ test@linux:~/cmusphinx/SphinxTrain> ./autogen.sh –enable-g2p-decoder ... test@linux:~/cmusphinx/SphinxTrain> make ... test@linux:~/cmusphinx/SphinxTrain> sudo make install ... test@linux:~/cmusphinx/SphinxTrain>

1.2.2. Sphinx-4

The g2p decoding functionality was introduced in revision 11556 in SVN. Further to sphinx-4, you need also to checkout the latest revision of the java fst framework

test@linux:~/cmusphinx> svn co https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/branches/g2p/fst ... test@linux:~/cmusphinx> svn co https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/sphinx4 ... test@linux:~/cmusphinx> cd fst test@linux:~/cmusphinx/fst> ant jar ... test@linux:~/cmusphinx/fst> cp dist/fst.jar ../sphinx4/lib/ test@linux:~/cmusphinx/fst> cd ../sphinx4/lib/ test@linux:~/cmusphinx/sphinx4/lib> ./jsapi.sh ... test@linux:~/cmusphinx/sphinx4/lib> cd .. test@linux:~/cmusphinx/sphinx4/lib> ant ... test@linux:~/cmusphinx/sphinx4/lib>

2. Training a g2p model

2.1. Training through SphinxTrain

Training an acoustic model following the instructions found at [6], can train also a g2p model.

As an addition to [6], for the current revision of SphinxTrain (11554). after running the sphinxtrain -t an4 setup command, you need to enable the g2p functionality by setting the $CFG_G2P_MODEL variable in the same file to

$CFG_G2P_MODEL= 'yes';

running sphinxtrain run according to [6] will produce an output related to the g2p model training similar to the following
... MODULE: 0000 train grapheme-to-phoneme model Phase 1: Cleaning up directories: logs... Phase 2: Training g2p model... Phase 3: Evaluating g2p model... INFO: cmd_ln.c(691): Parsing command line: /usr/local/lib/sphinxtrain/phonetisaurus-g2p \ -model /home/test/cmusphinx/an4/g2p/an4.fst \ -input /home/test/cmusphinx/an4/g2p/an4.words \ -beam 1500 \ -words yes \ -isfile yes \ -output_cost yes \ -output /home/test/cmusphinx/an4/g2p/an4.hyp Current configuration: [NAME] [DEFLT] [VALUE] -beam 500 1500 -help no no -input /home/test/cmusphinx/an4/g2p/an4.words -isfile no yes -model /home/test/cmusphinx/an4/g2p/an4.fst -nbest 1 1 -output /home/test/cmusphinx/an4/g2p/an4.hyp -output_cost no yes -sep -words no yes Words: 13 Hyps: 13 Refs: 13 (T)otal tokens in reference: 50 (M)atches: 46 (S)ubstitutions: 1 (I)nsertions: 4 (D)eletions: 3 % Correct (M/T) -- %92.00 % Token ER ((S+I+D)/T) -- %16.00 % Accuracy 1.0-ER -- %84.00 (S)equences: 13 (C)orrect sequences: 8 (E)rror sequences: 5 % Sequence ER (E/S) -- %38.46 % Sequence Acc (1.0-E/S) -- %61.54 Phase 4: Creating pronunciations for OOV words... Phase 5: Merging primary and OOV dictionaries... ...

The training process generates an additional dictionary for training transcription words not found in the dictionary, and creates pronunciations for them using the trained g2p model. After the training is completed, the model can be found under g2p dir. The openfst binary model is the an4.fst file. If you plan to convert it to a java binary model and use it in Sphinx-4, you need also the openfst text format which consist of the main model file (an4.fst.txt) and the two additional symbol files (an4.input.syms and an4.output.syms).

2.2. Training using the standalone application

Although the default training parameters provide a relative low Word Error Rate (WER), it might be possible to further fine tune the various parameters and produce a model with even lower WER. Assuming that the directory /home/test/cmusphinx/dict contains the cmudict.0.6d dictionary, a model can be build directly from the command line as follows
test@linux:~/cmusphinx/dict> export PATH=/usr/local/lib/sphinxtrain/:$PATH test@linux:~/cmusphinx/dict> g2p_train -ifile cmudict.0.6d INFO: cmd_ln.c(691): Parsing command line: g2p_train \ -ifile cmudict.0.6d Current configuration: [NAME] [DEFLT] [VALUE] -eps <eps> <eps> -gen_testset yes yes -help no no -ifile cmudict.0.6d -iter 10 10 -noalign no no -order 6 6 -pattern -prefix model model -prune no no -s1s2_delim -s1s2_sep } } -seq1in_sep -seq1_del no no -seq1_max 2 2 -seq2in_sep -seq2_del no no -seq2_max 2 2 -seq_sep | | -skip _ _ -smooth kneser_ney kneser_ney -theta 0 0.000000e+00 Splitting dictionary: cmudict.0.6d into training and test set Splitting... Using dictionary: model.train Loading... Starting EM... Iteration 1: 1.23585 Iteration 2: 0.176181 Iteration 3: 0.0564651 Iteration 4: 0.0156775 Iteration 5: 0.00727272 Iteration 6: 0.00368118 Iteration 7: 0.00259113 Iteration 8: 0.00118828 Iteration 9: 0.000779152 Iteration 10: 0.000844955 Iteration 11: 0.000470161 Generating best alignments... Generating symbols... Compiling symbols into FAR archive... Counting n-grams... Smoothing model... Minimizing model... Correcting final model... Writing text model to disk... Writing binary model to disk... test@linux:~/cmusphinx/dict>

the model then can be evaluated through the command line as follows

test@linux:~/cmusphinx/dict> /usr/local/lib64/sphinxtrain/scripts/0000.g2p_train/evaluate.py /usr/local/lib/sphinxtrain/ model.fst model.test eval INFO: cmd_ln.c(691): Parsing command line: /usr/local/lib/sphinxtrain/phonetisaurus-g2p \ -model /home/test/cmusphinx/an4/g2p/an4.fst \ -input /home/test/cmusphinx/an4/g2p/an4.words \ -beam 1500 \ -words yes \ -isfile yes \ -output_cost yes \ -output /home/test/cmusphinx/an4/g2p/an4.hyp Current configuration: [NAME] [DEFLT] [VALUE] -beam 500 1500 -help no no -input /home/test/cmusphinx/an4/g2p/an4.words -isfile no yes -model /home/test/cmusphinx/an4/g2p/an4.fst -nbest 1 1 -output /home/test/cmusphinx/an4/g2p/an4.hyp -output_cost no yes -sep -words no yes Words: 12946 Hyps: 12946 Refs: 12946 (T)otal tokens in reference: 82416 (M)atches: 74816 (S)ubstitutions: 6906 (I)nsertions: 1076 (D)eletions: 694 % Correct (M/T) -- %90.78 % Token ER ((S+I+D)/T) -- %10.53 % Accuracy 1.0-ER -- %89.47 (S)equences: 12946 (C)orrect sequences: 7859 (E)rror sequences: 5087 % Sequence ER (E/S) -- %39.29 % Sequence Acc (1.0-E/S) -- %60.71 test@linux:~/cmusphinx/dict>

3. Using a g2p model in Sphinx-4

Having the openfst text format trained model (this should be located in directory /home/test/cmusphinx/dict) of previous section, in order to use it in sphinx-4, it needs to be converted in the java fst binary format, as follows

test@linux:~/cmusphinx/dict> cd ../fst/ test@linux:~/cmusphinx/fst> ./openfst2java.sh ../dict/model ../sphinx4/models/model.fst.ser

and to use it in an application, add the following lines to the dictionary component in the configuration file

        <property name="allowMissingWords" value="true"/>
        <property name="createMissingWords" value="true"/>
        <property name="g2pModelPath" value="file:///home/test/cmusphinx/sphinx4/models/model.fst.ser"/>
        <property name="g2pMaxPron" value="2"/>

notice that the "wordReplacement" property should not exist in the dictionary component. The property "g2pModelPath" should contain a URI pointing to the g2p model in java fst format. The property "g2pMaxPron" holds the value of the number of different pronunciations generated by the g2p decoder for each word. For more information about sphinx-4 configuration can be found at [7].

Conclusion
This article tried to summarize the recent changes related to the new grapheme-to-phoneme (g2p) feature in CMU Sphinx-4 speech recognizer, from a user’s perspective. Other articles will present the API of the new java fst framework, created for the g2p feature, and it will follow a detailed performance review of the g2p decoder and the java fst framework in general.
As a future work suggestion, it would be interesting to evaluate the g2p decoder in automatic speech recognition context as the measure of of pronunciation variants that are correctly produced, and the number of incorrect variants generated, might not be directly related to the quality of the generated pronunciation variants when used in automatic speech recognition[8].

References
[1] “GSoC 2012: Letter to Phoneme Conversion in CMU Sphinx-4”
[2] CMUSphinx Home Page
[3] Java Fst Framework
[4] OpenFst Library Home Page
[5] OpenGrm NGram Library
[6] Training Acoustic Model For CMUSphinx
[7] Sphinx-4 Application Programmer’s Guide
[8] D. Jouvet, D. Fohr, I. Illina, “Evaluating Grapheme-to-Phoneme Converters in Automatic Speech Recognition Context”, IEEE International Conference on Acoustics, Speech, and Signal Processing, March 2012, Japan

2 thoughts on “Using the grapheme-to-phoneme feature in CMU Sphinx-4”

Shiu You-sheng 11/23/2012

Great article, but I have one question about using a g2p model in Sphinx-4. Can the setting apply to FastDictionary in CMU sphinx 4 1.0 beta6? Or I have to find a new version of FastDictionary? Because I didn’t find any variable would deal with properties like g2pModelPath and g2pMaxPron in the FastDictionary in CMU sphinx 4 1.0 beta6.

Reply ↓
1. John Salatas Post author11/24/2012
  
  You need to checkout the latest code from SVN.
  
  Reply ↓

2 thoughts on “Using the grapheme-to-phoneme feature in CMU Sphinx-4”

Leave a Reply Cancel reply