Using OpenGrm NGram Library for the encoding of joint multigram language models as WFST

June 9, 2012 by

(originally posted at http://cmusphinx.sourceforge.net/2012/06/using-opengrm-ngram-library-for-the-encoding-of-joint-multigram-language-models-as-wfst/)

Foreword
This article will review the OpenGrm NGram Library [1] and its usage for language modeling in ASR. OpenGrm makes use of functionality in the openFST library [2] to create, access and manipulate n-gram language models and it can be used as the language model training toolkit for integrating phonetisaurus’ model training procedures [3] into a simplified application.

1. Model Training
Having the aligned corpus produced from cmudict by the aligner code of our previous article [4], the first step is to generate an OpenFst-style symbol table for the text tokens in input corpus. This can be done with: [5]

# ngramsymbols < cmudict.corpus > cmudict.syms

Given the symbol table, a text corpus can be converted to a binary finite state archive (far) [6] with:

# farcompilestrings --symbols=cmudict.syms --keep_symbols=1 cmudict.corpus > cmudict.far

Next step is to count n-grams from an input corpus, converted in FAR format. It produces an n-gram model in the FST format. By using the switch –order the maximum length n-gram to count can be chosen.

The 1-gram through 9-gram counts for the cmudict.far finite-state archive file created above can be created with:

# ngramcount --order=9 cmudict.far > cmudict.cnts

Finally the 9-gram counts in cmudict.cnts above can be converted to a WFST model with:

# ngrammake --method="kneser_ney" cmudict.cnts > cmudict.mod

The –method option is used for selecting the smoothing method [7] from one of the six available:

  • witten_bell: smooths using Witten-Bell [8], with a hyperparameter k, as presented in [9].
  • absolute: smooths based on Absolute Discounting [10], using bins and discount parameters.
  • katz: smooths based on Katz Backoff [11], using bins parameters.
  • kneser_ney: smooths based on Kneser-Ney [12], a variant of Absolute Discounting.
  • presmoothed: normalizes at each state based on the n-gram count of the history.
  • unsmoothed: normalizes the model but provides no smoothing.

2. Evaluation – Comparison with phonetisaurus

In order to evaluate OpenGrm models, ther procedure described above was repeated using the standard 90%-10% split of the cmudict into a training and test set respectively. The binary fst format produced by ngrammake wasn’t readable by the phonetisaurus evaluation script, so it was converted to ARPA format with:

# ngramprint --ARPA cmudict.mod > cmudict.arpa

and then back to a phonetisaurus binary fst format with:

# phonetisaurus-arpa2fst --input=cmudict.arpa --prefix="cmudict/cmudict"

Finally the test set was evaluated with

# evaluate.py --modelfile cmudict/cmudict.fst --testfile cmudict.dict.test --prefix cmudict/cmudict
Words: 13328 Hyps: 13328 Refs: 13328
##############################################
EVALUATION RESULTS
---------------------------------------------------------------------
(T)otal tokens in reference: 84955
(M)atches: 77165 (S)ubstitutions: 7044 (I)nsertions: 654 (D)eletions: 746
% Correct (M/T) -- %90.83
% Token ER ((S+I+D)/T) -- %9.94
% Accuracy 1.0-ER -- %90.06
---------------------------------------------------------------------
(S)equences: 13328 (C)orrect sequences: 8010 (E)rror sequences: 5318
% Sequence ER (E/S) -- %39.90
% Sequence Acc (1.0-E/S) -- %60.10
##############################################

3. Conclusion – Future Works

The evaluation results above are, as expected, almost identical with those produced by the phonetisaurus’ procedures and the use of MITLM toolkit instead of OpenGrm.

Having the above description, the next step is to integrate all of the commands above into a simplified application, combined with the dictionary alignment code introduced in our previous article [13].

References

[1] OpenGrm NGram Library

[2] openFST library

[3] “Phonetisaurus: A WFST-driven Phoneticizer – Framework Review”, ICT Research Blog, May 2012.

[4] “Porting phonetisaurus many-to-many alignment python script to C++”, ICT Research Blog, May 2012.

[5] OpenGrm NGram Library Quick Tour

[6] OpenFst Extensions: FST Archives (FARs)

[7] S.F. Chen, G. Goodman, “An Empirical Study of Smoothing Techniques for Language Modeling”, Harvard Computer Science Technical report TR-10-98, 1998.

[8] I. H. Witten, T. C. Bell, “The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression”, IEEE Transactions on Information Theory 37 (4), pp. 1085–1094, 1991.

[9] B. Carpenter, “Scaling high-order character language models to gigabytes”, Proceedings of the ACL Workshop on Software, pp. 86–99, 2005.

[10] H. Ney, U. Essen, R. Kneser, “On structuring probabilistic dependences in stochastic language modeling”, Computer Speech and Language 8, pp. 1–38, 1994.

[11] S. M. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recogniser”, IEEE Transactions on Acoustics, Speech, and Signal Processing 35 (3), pp. 400–401, 1987.

[12] R. Kneser, H. Ney, “Improved backing-off for m-gram language modeling”, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). pp. 181–184, 1995.

[13] Porting phonetisaurus many-to-many alignment python script to C++

4 thoughts on “Using OpenGrm NGram Library for the encoding of joint multigram language models as WFST

  1. Robert E. Coifman, M. D.

    I am the senior developer of a group of concepts to reduce transacription errors in computer recognition of free text dictation by familiar system users (such as physicians) into fields of forms in databases (such as electronic medical records). I have a good qualitative but not quantitative sense of what it takes to accomplish what we want to do. I have two very bright young computer-savvy associates but none of us has extensive working experience with speech recognition coding and each of us has limited in available time to acquire these skills by our day job (mine being a solo medical practice).

    None of the available closed source speech applications allow the software access needed to apply our concepts and the consensus of my two partners is that Sphinx is the best platform for this purpose.

    I don’t want to put all our ideas out for public display but would welcome an opportunity to discuss them witn you in a less public way.

    Can you give me a way to reach you by email, with the idea of talking by phone if you find our concepts interesting.

    Thanks.

    Sincerely,

    Robert E. Coifman, M. D.

    Reply
    1. mmJohn Salatas Post author

      Hello Robert,

      I have just send you an email. You can reply to that providing more info.

      Looking forward in hearing from you soon.

      Best Regards,

      John Salatas

      Reply
  2. Stanislav Jan Sarman

    Dear Mr. Salatas,
    I’m trying to build my first language model using Opengrm. Where can I find
    cmudict.corpus
    Thank You
    S.J.Sarman

    Reply

Leave a Reply to Stanislav Jan Sarman Cancel reply

Your email address will not be published. Required fields are marked *