Using the grapheme-to-phoneme feature in CMU Sphinx-4

August 13, 2012 by

Foreword

This article summarizes and updates the previous articles [1] related to the new grapheme-to-phoneme (g2p) feature in CMU Sphinx-4 speech recognizer [2].

In order to support automatic g2p transcription in Sphinx-4 there were created a new weighted finite state transducers (wfst) in java [3] which its current API will be presented in a future article. There were also created various new applications for which its installation procedure and usage will be presented in the following sections.

The procedures presented here were verified using openSuSE 12.1 x64 under a VirtualBox machine, but should apply to all recent linux distributions (either 32 or 64 bit). They assume that you are logged in a user test and all required software is saved under /home/test/cmusphinx directory. As a final note, the various commands outputs where omitted in this article, but should be watched for any errors or information especially in case of troubleshooting.

1. Installation

1.1. Required 3rd party libraries and applications

The following 3rd libraries should be installed in your system, before installing and running the main applications. As a notice these are only required in order to train new g2p models. They are not required if you want to use a g2p model in Sphinx-4.

1.1.1. OpenFst

OpenFst [4] is a library written in C++ for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). You can download the latest version available at [4]. This article uses version 1.3.2.

test@linux:~/cmusphinx> wget http://www.openfst.org/twiki/pub/FST/FstDownload/openfst-1.3.2.tar.gz
...
test@linux:~/cmusphinx> tar -xzf openfst-1.3.2.tar.gz
test@linux:~/cmusphinx> cd openfst-1.3.2/
test@linux:~/cmusphinx/openfst-1.3.2> ./configure --enable-compact-fsts --enable-const-fsts --enable-far --enable-lookahead-fsts --enable-pdt
...
test@linux:~/cmusphinx/openfst-1.3.2> make
...
test@linux:~/cmusphinx/openfst-1.3.2> sudo make install
...
test@linux:~/cmusphinx/openfst-1.3.2> cd ..

1.1.2. OpenGrm NGram

The OpenGrm NGram library [5] is used for making and modifying n-gram language models encoded as weighted finite-state transducers (FSTs). It makes use of functionality in the OpenFst library to create, access and manipulate n-gram models. You can download the latest version available at [5]. This article uses version 1.0.3.

test@linux:~/cmusphinx> wget http://www.openfst.org/twiki/pub/GRM/NGramDownload/opengrm-ngram-1.0.3.tar.gz
...
test@linux:~/cmusphinx> tar -xzf opengrm-ngram-1.0.3.tar.gz
test@linux:~/cmusphinx> cd opengrm-ngram-1.0.3/
test@linux:~/cmusphinx/opengrm-ngram-1.0.3> ./configure
...
test@linux:~/cmusphinx/opengrm-ngram-1.0.3> make
...

In case the make command fail to complete in 64bit operating systems, try re-executing the configure command and rerun make as follows

test@linux:~/cmusphinx/opengrm-ngram-1.0.3> ./configure LDFLAGS=-L/usr/local/lib64/fst
...
test@linux:~/cmusphinx/opengrm-ngram-1.0.3> make
...
test@linux:~/cmusphinx/opengrm-ngram-1.0.3> sudo make install
...
test@linux:~/cmusphinx/opengrm-ngram-1.0.3> cd ..

1.2. Main applications

1.2.1. SphinxTrain

Having openFst and openGrm libraries installed, the training of a new model can be achieved  in  SphinxTrain while training a new acoustic model [6].

test@linux:~/cmusphinx> svn co https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/sphinxbase
...
test@linux:~/cmusphinx> svn co https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/sphinxbase
...
test@linux:~/cmusphinx> svn co https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/pocketsphinx
...
test@linux:~/cmusphinx> svn co https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/SphinxTrain
...
test@linux:~/cmusphinx> cd sphinxbase/
test@linux:~/cmusphinx/sphinxbase> ./autogen.sh
...
test@linux:~/cmusphinx/sphinxbase> make
...
test@linux:~/cmusphinx/sphinxbase> sudo make install
...
test@linux:~/cmusphinx> cd ../pocketsphinx/
test@linux:~/cmusphinx/sphinxbase> ./autogen.sh
...
test@linux:~/cmusphinx/sphinxbase> make
...
test@linux:~/cmusphinx/sphinxbase> sudo make install
...
test@linux:~/cmusphinx/sphinxbase> cd ../SphinxTrain/
test@linux:~/cmusphinx/SphinxTrain> ./autogen.sh –enable-g2p-decoder
...
test@linux:~/cmusphinx/SphinxTrain> make
...
test@linux:~/cmusphinx/SphinxTrain> sudo make install
...
test@linux:~/cmusphinx/SphinxTrain>

1.2.2. Sphinx-4

The g2p decoding functionality was introduced in revision 11556 in SVN. Further to sphinx-4, you need also to checkout the latest revision of the java fst framework

test@linux:~/cmusphinx> svn co https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/branches/g2p/fst
...
test@linux:~/cmusphinx> svn co https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/sphinx4
...
test@linux:~/cmusphinx> cd fst
test@linux:~/cmusphinx/fst> ant jar
...
test@linux:~/cmusphinx/fst> cp dist/fst.jar ../sphinx4/lib/
test@linux:~/cmusphinx/fst> cd ../sphinx4/lib/
test@linux:~/cmusphinx/sphinx4/lib> ./jsapi.sh
...
test@linux:~/cmusphinx/sphinx4/lib> cd ..
test@linux:~/cmusphinx/sphinx4/lib> ant
...
test@linux:~/cmusphinx/sphinx4/lib>

2. Training a g2p model

2.1. Training through SphinxTrain

Training an acoustic model following the instructions found at [6], can train also a g2p model.

As an addition to [6], for the current revision of SphinxTrain (11554). after running the sphinxtrain -t an4 setup command, you need to enable the g2p functionality by setting the $CFG_G2P_MODEL variable in the same file to

$CFG_G2P_MODEL= 'yes';

running sphinxtrain run according to [6] will produce an output related to the g2p model training similar to the following
...
MODULE: 0000 train grapheme-to-phoneme model
Phase 1: Cleaning up directories: logs...
Phase 2: Training g2p model...
Phase 3: Evaluating g2p model...
INFO: cmd_ln.c(691): Parsing command line:
/usr/local/lib/sphinxtrain/phonetisaurus-g2p \
-model /home/test/cmusphinx/an4/g2p/an4.fst \
-input /home/test/cmusphinx/an4/g2p/an4.words \
-beam 1500 \
-words yes \
-isfile yes \
-output_cost yes \
-output /home/test/cmusphinx/an4/g2p/an4.hyp
Current configuration:
[NAME]  [DEFLT] [VALUE]
-beam  500 1500
-help  no  no
-input     /home/test/cmusphinx/an4/g2p/an4.words
-isfile    no yes
-model     /home/test/cmusphinx/an4/g2p/an4.fst
-nbest  1  1
-output     /home/test/cmusphinx/an4/g2p/an4.hyp
-output_cost  no yes
-sep
-words  no yes
Words: 13  Hyps: 13 Refs: 13
(T)otal tokens in reference: 50
(M)atches: 46  (S)ubstitutions: 1  (I)nsertions: 4  (D)eletions: 3
% Correct (M/T)           -- %92.00
% Token ER ((S+I+D)/T)    -- %16.00
% Accuracy 1.0-ER         -- %84.00
(S)equences: 13  (C)orrect sequences: 8  (E)rror sequences: 5
% Sequence ER (E/S)       -- %38.46
% Sequence Acc (1.0-E/S)  -- %61.54
Phase 4: Creating pronunciations for OOV words...
Phase 5: Merging primary and OOV dictionaries...
...

The training process generates an additional dictionary for training transcription words not found in the dictionary, and creates pronunciations for them using the trained g2p model. After the training is completed, the model can be found under g2p dir. The openfst binary model is the an4.fst file. If you plan to convert it to a java binary model and use it in Sphinx-4, you need also the openfst text format which consist of the main model file (an4.fst.txt) and the two additional symbol files (an4.input.syms and an4.output.syms).

2.2. Training using the standalone application

Although the default training parameters provide a relative low Word Error Rate (WER), it might be possible to further fine tune the various parameters and produce a model with even lower WER. Assuming that the directory /home/test/cmusphinx/dict contains the cmudict.0.6d dictionary, a model can be build directly from the command line as follows
test@linux:~/cmusphinx/dict> export PATH=/usr/local/lib/sphinxtrain/:$PATH
test@linux:~/cmusphinx/dict> g2p_train -ifile cmudict.0.6d
INFO: cmd_ln.c(691): Parsing command line:
g2p_train \
-ifile cmudict.0.6d
Current configuration:
[NAME]          [DEFLT]         [VALUE]
-eps            <eps>           <eps>
-gen_testset    yes             yes
-help           no              no
-ifile                          cmudict.0.6d
-iter           10              10
-noalign        no              no
-order          6               6
-pattern
-prefix         model           model
-prune          no              no
-s1s2_delim
-s1s2_sep       }               }
-seq1in_sep
-seq1_del       no              no
-seq1_max       2               2
-seq2in_sep
-seq2_del       no              no
-seq2_max       2               2
-seq_sep        |               |
-skip           _               _
-smooth         kneser_ney      kneser_ney
-theta          0               0.000000e+00
Splitting dictionary: cmudict.0.6d into training and test set
Splitting...
Using dictionary: model.train
Loading...
Starting EM...
Iteration 1: 1.23585
Iteration 2: 0.176181
Iteration 3: 0.0564651
Iteration 4: 0.0156775
Iteration 5: 0.00727272
Iteration 6: 0.00368118
Iteration 7: 0.00259113
Iteration 8: 0.00118828
Iteration 9: 0.000779152
Iteration 10: 0.000844955
Iteration 11: 0.000470161
Generating best alignments...
Generating symbols...
Compiling symbols into FAR archive...
Counting n-grams...
Smoothing model...
Minimizing model...
Correcting final model...
Writing text model to disk...
Writing binary model to disk...
test@linux:~/cmusphinx/dict>

the model then can be evaluated through the command line as follows

test@linux:~/cmusphinx/dict> /usr/local/lib64/sphinxtrain/scripts/0000.g2p_train/evaluate.py /usr/local/lib/sphinxtrain/ model.fst model.test eval
INFO: cmd_ln.c(691): Parsing command line:
/usr/local/lib/sphinxtrain/phonetisaurus-g2p \
-model /home/test/cmusphinx/an4/g2p/an4.fst \
-input /home/test/cmusphinx/an4/g2p/an4.words \
-beam 1500 \
-words yes \
-isfile yes \
-output_cost yes \
-output /home/test/cmusphinx/an4/g2p/an4.hyp
Current configuration:
[NAME]  [DEFLT] [VALUE]
-beam  500 1500
-help  no  no
-input     /home/test/cmusphinx/an4/g2p/an4.words
-isfile    no yes
-model     /home/test/cmusphinx/an4/g2p/an4.fst
-nbest  1  1
-output     /home/test/cmusphinx/an4/g2p/an4.hyp
-output_cost  no yes
-sep
-words  no yes
Words: 12946  Hyps: 12946 Refs: 12946
(T)otal tokens in reference: 82416
(M)atches: 74816  (S)ubstitutions: 6906  (I)nsertions: 1076  (D)eletions: 694
% Correct (M/T)           -- %90.78
% Token ER ((S+I+D)/T)    -- %10.53
% Accuracy 1.0-ER         -- %89.47
(S)equences: 12946  (C)orrect sequences: 7859  (E)rror sequences: 5087
% Sequence ER (E/S)       -- %39.29
% Sequence Acc (1.0-E/S)  -- %60.71
test@linux:~/cmusphinx/dict>

3. Using a g2p model in Sphinx-4

Having the openfst text format trained model (this should be located in directory /home/test/cmusphinx/dict) of previous section, in order to use it in sphinx-4, it needs to be converted in the java fst binary format, as follows

test@linux:~/cmusphinx/dict> cd ../fst/
test@linux:~/cmusphinx/fst> ./openfst2java.sh ../dict/model ../sphinx4/models/model.fst.ser

and to use it in an application, add the following lines to the dictionary component in the configuration file

1
2
3
4
        <property name="allowMissingWords" value="true"/>
        <property name="createMissingWords" value="true"/>
        <property name="g2pModelPath" value="file:///home/test/cmusphinx/sphinx4/models/model.fst.ser"/>
        <property name="g2pMaxPron" value="2"/>

notice that the "wordReplacement" property should not exist in the dictionary component. The property "g2pModelPath" should contain a URI pointing to the g2p model in java fst format. The property "g2pMaxPron" holds the value of the number of different pronunciations generated by the g2p decoder for each word. For more information about sphinx-4 configuration can be found at [7].

Conclusion
This article tried to summarize the recent changes related to the new grapheme-to-phoneme (g2p) feature in CMU Sphinx-4 speech recognizer, from a user’s perspective. Other articles will present the API of the new java fst framework, created for the g2p feature, and it will follow a detailed performance review of the g2p decoder and the java fst framework in general.
As a future work suggestion, it would be interesting to evaluate the g2p decoder in automatic speech recognition context as the measure of of pronunciation variants that are correctly produced, and the number of incorrect variants generated, might not be directly related to the quality of the generated pronunciation variants when used in automatic speech recognition[8].

References
[1] “GSoC 2012: Letter to Phoneme Conversion in CMU Sphinx-4”
[2] CMUSphinx Home Page
[3] Java Fst Framework
[4] OpenFst Library Home Page
[5] OpenGrm NGram Library
[6] Training Acoustic Model For CMUSphinx
[7] Sphinx-4 Application Programmer’s Guide
[8] D. Jouvet, D. Fohr, I. Illina, “Evaluating Grapheme-to-Phoneme Converters in Automatic Speech Recognition Context”, IEEE International Conference on Acoustics, Speech, and Signal Processing, March 2012, Japan

2 thoughts on “Using the grapheme-to-phoneme feature in CMU Sphinx-4

  1. Shiu You-sheng

    Great article, but I have one question about using a g2p model in Sphinx-4. Can the setting apply to FastDictionary in CMU sphinx 4 1.0 beta6? Or I have to find a new version of FastDictionary? Because I didn’t find any variable would deal with properties like g2pModelPath and g2pMaxPron in the FastDictionary in CMU sphinx 4 1.0 beta6.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *