Monthly Archives: June 2012

Automating the creation of joint multigram language models as WFST: Part 2

(originally posted at http://cmusphinx.sourceforge.net/2012/06/automating-the-creation-of-joint-multigram-language-models-as-wfst-part-2/)

Foreword

This a article presents an updated version of the model training application originally discussed in [1], considering the compatibility issues with phonetisaurus decoder as presented in [2]. The updated code introduces routines to regenerate a new binary fst model compatible with phonetisaurus’ decoder as suggested in [2] which will be reviewed in the next section.

1. Code review

The basic code for the model regeneration is defined in train.cpp in procedure

1
void relabel(StdMutableFst *fst, StdMutableFst *out, string eps, string skip, string s1s2_sep, string seq_sep);

where fst and out are the input and output (the regenerated) models respectively.

In the first initialization step is to generate new input, output and states SymbolTables and add to output the new start and final states [2].

Furthermore, in this step the SymbolTables are initialized. Phonetisauruss decoder requires the symbols   eps, seq_sep, “<phi>”, “<s>” and “<s>” to be in keys 0, 1, 2, 3, 4 accordingly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
void relabel(StdMutableFst *fst, StdMutableFst *out, string eps, string skip, string s1s2_sep, string seq_sep) {
	ArcSort(fst, StdILabelCompare());
	const SymbolTable *oldsyms = fst->InputSymbols();
 
	// Uncomment the next line in order to save the original model
	// as created by ngram
	// fst->Write("org.fst");
	// generate new input, output and states SymbolTables
	SymbolTable *ssyms = new SymbolTable("ssyms");
	SymbolTable *isyms = new SymbolTable("isyms");
	SymbolTable *osyms = new SymbolTable("osyms");
 
	out->AddState();
	ssyms->AddSymbol("s0");
	out->SetStart(0);
 
	out->AddState();
	ssyms->AddSymbol("f");
	out->SetFinal(1, TropicalWeight::One());
 
	isyms->AddSymbol(eps);
	osyms->AddSymbol(eps);
 
	//Add separator, phi, start and end symbols
	isyms->AddSymbol(seq_sep);
	osyms->AddSymbol(seq_sep);
	isyms->AddSymbol("<phi>");
	osyms->AddSymbol("<phi>");
	int istart = isyms->AddSymbol("<s>");
	int iend = isyms->AddSymbol("</s>");
	int ostart = osyms->AddSymbol("<s>");
	int oend = osyms->AddSymbol("</s>");
 
	out->AddState();
	ssyms->AddSymbol("s1");
	out->AddArc(0, StdArc(istart, ostart, TropicalWeight::One(), 2));
	...

In the main step, the code iterates through each State of the input model and adds each one to the output model keeping track of old and new state_id in ssyms SymbolTable.

In order to transform to an output model with a single final state [2] the code checks if the current state is final and if it is, it adds an new arc connecting from the current state to the single final one (state_id 1) with label “</s>:</s>” and weight equal to the current state’s final weight. It also sets the final weight of current state equal to TropicalWeight::Zero() (ie it converts the current state to a non final).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
	...
	for (StateIterator<StdFst> siter(*fst); !siter.Done(); siter.Next()) {
		StateId state_id = siter.Value();
 
		int64 newstate;
		if (state_id == fst->Start()) {
			newstate = 2;
		} else {
			newstate = ssyms->Find(convertInt(state_id));
			if(newstate == -1 ) {
				out->AddState();
				ssyms->AddSymbol(convertInt(state_id));
				newstate = ssyms->Find(convertInt(state_id));
			}
		}
 
		TropicalWeight weight = fst->Final(state_id);
 
		if (weight != TropicalWeight::Zero()) {
			// this is a final state
			StdArc a = StdArc(iend, oend, weight, 1);
			out->AddArc(newstate, a);
			out->SetFinal(newstate, TropicalWeight::Zero());
		}
		addarcs(state_id, newstate, oldsyms, isyms, osyms, ssyms, eps, s1s2_sep, fst, out);
	}
	out->SetInputSymbols(isyms);
	out->SetOutputSymbols(osyms);
	ArcSort(out, StdOLabelCompare());
	ArcSort(out, StdILabelCompare());
}

Lastly, the addarcs procuder is called in order to relabel the arcs of each state of the input model and add them to the output model. It also creates any missing states (ie missing next states of an arc).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
void addarcs(StateId state_id, StateId newstate, const SymbolTable* oldsyms, SymbolTable* isyms,
		SymbolTable* osyms,	SymbolTable* ssyms,	string eps,	string s1s2_sep, StdMutableFst *fst,
		StdMutableFst *out) {
	for (ArcIterator<StdFst> aiter(*fst, state_id); !aiter.Done(); aiter.Next()) {
		StdArc arc = aiter.Value();
		string oldlabel = oldsyms->Find(arc.ilabel);
		if(oldlabel == eps) {
			oldlabel = oldlabel.append("}");
			oldlabel = oldlabel.append(eps);
		}
		vector<string> tokens;
		split_string(&oldlabel, &tokens, &s1s2_sep, true);
		int64 ilabel = isyms->AddSymbol(tokens.at(0));
		int64 olabel = osyms->AddSymbol(tokens.at(1));
 
		int64 nextstate = ssyms->Find(convertInt(arc.nextstate));
		if(nextstate  == -1 ) {
			out->AddState();
			ssyms->AddSymbol(convertInt(arc.nextstate));
			nextstate = ssyms->Find(convertInt(arc.nextstate));
		}
		out->AddArc(newstate, StdArc(ilabel, olabel, (arc.weight != TropicalWeight::Zero())?arc.weight:TropicalWeight::One(), nextstate));
		//out->AddArc(newstate, StdArc(ilabel, olabel, arc.weight, nextstate));
	}
}

2. Performance – Evaluation

In order to evaluate the perfomance of the model generated with the new code. A new model was trained  with the same dictionaries as in [4]

# train/train --order 9 --smooth "kneser_ney" --seq1_del --seq2_del --ifile cmudict.dict.train  --ofile cmudict.fst

and evaluated with phonetisaurus evaluate script

# evaluate.py --modelfile cmudict.fst --testfile ../cmudict.dict.test --prefix cmudict/cmudict
Mapping to null...
Words: 13335  Hyps: 13335 Refs: 13335
######################################################################
EVALUATION RESULTS
----------------------------------------------------------------------
(T)otal tokens in reference: 84993
(M)atches: 77095  (S)ubstitutions: 7050  (I)nsertions: 634  (D)eletions: 848
% Correct (M/T)           -- %90.71
% Token ER ((S+I+D)/T)    -- %10.04
% Accuracy 1.0-ER         -- %89.96
--------------------------------------------------------
(S)equences: 13335  (C)orrect sequences: 7975  (E)rror sequences: 5360
% Sequence ER (E/S)       -- %40.19
% Sequence Acc (1.0-E/S)  -- %59.81
######################################################################

3. Conclusions – Future work

The evaluation results in cmudict dictionary, are a little bit worst than using the command line procedure in [4]. Although the difference doesn’t seem to be important, it needs a further investigation. For that purpose and for general one can uncomment the line fst->Write(“org.fst”); in the relabel procedure as depicted in a previous section, in order to have the original binary model saved in a file called “org.fst”.

Next steps would probably be to write code in order to load the binary model in java code and to port the decoding algorithm along with the required fst operations to java and eventually integrate it with CMUSphinx.

References
[1] J. Salatas, “Automating the creation of joint multigram language models as WFST”, ICT Research Blog, June 2012.
[2] J. Salatas, Compatibility issues using binary fst models generated by OpenGrm NGram Library with phonetisaurus decoder”, ICT Research Blog, June 2012.
[3] fstinfo on models created with opengrm
[4] J. Salatas, Using OpenGrm NGram Library for the encoding of joint multigram language models as WFST”, ICT Research Blog, June 2012.

Compatibility issues using binary fst models generated by OpenGrm NGram Library with phonetisaurus decoder

(originally posted at http://cmusphinx.sourceforge.net/2012/06/compatibility-issues-using-binary-fst-models-generated-by-opengrm-ngram-library-with-phonetisaurus-decoder/)

Foreword

Previous articles have shown how to use OpenGrm NGram Library for the encoding of joint multigram language models as WFST [1] and provided the code that simplifies and automates the fst model training [2]. As described in [1] the generated binary fst models with the procedures described in those articles are not directly usable from phonetisaurus [3] decoder.

This article will try to describe the compatibility issues in more details and provide some intuition on possible solutions and workarounds.

1. Open issues

Assuming a binary fst model, named cmudict.fst generated as described in [1], trying to evaluate it with phonetisaurus evaluate python script, results in the following error

$ ./evaluate.py --order 9 --modelfile cmudict.fst --testfile cmudict.dict.test.tabbed --prefix cmudict/cmudict
../phonetisaurus-g2p --model=cmudict.fst --input=cmudict/cmudict.words --beam=1500 --alpha=0.6500 --prec=0.8500 --ratio=0.7200 --order=9 --words --isfile  > cmudict/cmudict.hyp
Symbol: 'A' not found in input symbols table.
Mapping to null...
Symbol: '4' not found in input symbols table.
Mapping to null...
Symbol: '2' not found in input symbols table.
Mapping to null...
Symbol: '1' not found in input symbols table.
Mapping to null...
Symbol: '2' not found in input symbols table.
Mapping to null...
Symbol: '8' not found in input symbols table.
Mapping to null...
sh: line 1: 18788 Segmentation fault      ../phonetisaurus-g2p --model=cmudict.fst --input=cmudict/cmudict.words --beam=1500 --alpha=0.6500 --prec=0.8500 --ratio=0.7200 --order=9 --words --isfile > cmudict/cmudict.hyp
Words: 0  Hyps: 0 Refs: 13328
Traceback (most recent call last):
File "./evaluate.py", line 124, in <module>
mbrdecode=args.mbrdecode, beam=args.beam, alpha=args.alpha, precision=args.precision, ratio=args.ratio, order=args.order
File "./evaluate.py", line 83, in evaluate_testset
PERcalculator.compute_PER_phonetisaurus( hypothesisfile, referencefile, verbose=verbose )
File "calculateER.py", line 333, in compute_PER_phonetisaurus
assert len(words)==len(hyps) and len(hyps)==len(refs)
AssertionError

2. Disussion on open issues

In order to investigate the error above, a simple aligned corpus was created as below

a}a b}b
a}a
a}a c}c
a}a b}b c}c
a}a c}c
b}b c}c

This simple corpus was used as input to both ngram and phonetisaurus model training procedures [1], [3] and the generated fst model (binary.fst) was visualized as follows

$ fstdraw binary.fst binary.dot
$ dot -Tps binary.dot > binary.ps

the generated two postscript outputs, using phonetisaurus and ngram respectively, are shown in the following figures:

Visualization of the fst model generated by phonetisaurus

Figure 1: Visualization of the fst model generated by phonetisaurus

Visualization of the fst model generated by ngram

Figure 2: Visualization of the fst model generated by ngram

By studying the two figures above the first obvious conclusion is that the model generated with ngram do not distinguish between graphemes (input label) and phonemes (output label), but it uses the joint multigram (eg “a}a”) as both input and output label.

Another obvious difference between the two models is the starting and final state(s). Phonetisaurus models have a fixed starting state with a single outgoing arc with labels <s>:<s> and no weight. Furthermore, in phonetisaurus model there exist only a single final state with </s>:</s> labels in all incoming arcs.

3. Possible solutions and workarounds

[1] already presents a workaround to bypass the compatibility issues described in the previous section. We can simply export the ngram generated binary fst model to an ARPA format with:

$ ngramprint --ARPA binary.fst > export.arpa

and the use phonetisaurus’ arpa2fst convert utility to regenerate a new binary fst model compatible with phonetisaurus’ decoder.

$ phonetisaurus-arpa2fst –input=export.arpa

Although this is a straightforward procedure, a  more elegant solution would be to iterate in all arcs and manually break apart the grapheme/phoneme joint multigram labels to input/output labels accordingly. A final step, if necessary, would be to add the start and single final state in the ngram generated fst model.

4. Conclusion – Future work

This article tried to provide some intuition on the compatibility issues between ngram and phonetisaurus models and also to possible solutions and workarounds. As explained in the previous section, there is already a straightforward procedure to overcome these issues (export to ARPA format and regenerate the binary model), but in any case it is worth to try the relabeling approach described also in the previous section.

References

[1] J. Salatas, “Using OpenGrm NGram Library for the encoding of joint multigram language models as WFST”, ICT Research Blog, June 2012.

[2] J. Salatas, “Automating the creation of joint multigram language models as WFST”, ICT Research Blog, June 2012.

[3] J. Salatas, “Phonetisaurus: A WFST-driven Phoneticizer – Framework Review”, ICT Research Blog, June 2012.

Automating the creation of joint multigram language models as WFST

Notice: This article is outdated. The application described here is now part of the SphinxTrain application. Please refer to recent articles in CMUSphinx category for the latest info.

(originally posted at http://cmusphinx.sourceforge.net/2012/06/automating-the-creation-of-joint-multigram-language-models-as-wfst/)

Foreword

Previous articles have introduced the C++ code to align a pronounciation dictionary [1] and how this aligned dictionary can be used in combination with OpenGrm Ngram Library for the encoding of joint multigram language models as WFST [2]. This article will describe the automation of the language model creation procedures as a complete C++ application that is simpler to use than the original procedures described in [2].

1. Installation
The procedure below is tested on an Intel CPU running openSuSE 12.1 x64 with gcc 4.6.2. Further testing is required for other systems (MacOSX, Windows).

The code requires the openFST library to be installed on your system. Having downloaded, compiled and installed openFST, the first step is to checkout the code from the cmusphinx SVN repository:

$ svn co https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/branches/g2p/train

and compile it

$ cd train
$ make
g++ -c -g -o src/train.o src/train.cpp
g++ -c -g -o src/phonetisaurus/M2MFstAligner.o src/phonetisaurus/M2MFstAligner.cpp
g++ -c -g -o src/phonetisaurus/FstPathFinder.o src/phonetisaurus/FstPathFinder.cpp
g++ -g -L/usr/local/lib64/fst -lfst -lfstfar -lfstfarscript -ldl -lngram -o train src/train.o src/phonetisaurus/M2MFstAligner.o src/phonetisaurus/FstPathFinder.o
$

2. Usage
Having compiled the script, running it without any command line arguments will print out it’s usage:
$ ./train
Input file not provided
Usage: ./train [--seq1_del] [--seq2_del] [--seq1_max SEQ1_MAX] [--seq2_max SEQ2_MAX]
[--seq1_sep SEQ1_SEP] [--seq2_sep SEQ2_SEP] [--s1s2_sep S1S2_SEP]
[--eps EPS] [--skip SKIP] [--seq1in_sep SEQ1IN_SEP] [--seq2in_sep SEQ2IN_SEP]
[--s1s2_delim S1S2_DELIM] [--iter ITER] [--order ORDER] [--smooth SMOOTH]
[--noalign] --ifile IFILE --ofile OFILE
--seq1_del, Allow deletions in sequence 1. Defaults to false.
--seq2_del, Allow deletions in sequence 2. Defaults to false.
--seq1_max SEQ1_MAX, Maximum subsequence length for sequence 1. Defaults to 2.
--seq2_max SEQ2_MAX, Maximum subsequence length for sequence 2. Defaults to 2.
--seq1_sep SEQ1_SEP, Separator token for sequence 1. Defaults to '|'.
--seq2_sep SEQ2_SEP, Separator token for sequence 2. Defaults to '|'.
--s1s2_sep S1S2_SEP, Separator token for seq1 and seq2 alignments. Defaults to '}'.
--eps EPS, Epsilon symbol. Defaults to ''.
--skip SKIP, Skip/null symbol. Defaults to '_'.
--seq1in_sep SEQ1IN_SEP, Separator for seq1 in the input training file. Defaults to ''.
--seq2in_sep SEQ2IN_SEP, Separator for seq2 in the input training file. Defaults to ' '.
--s1s2_delim S1S2_DELIM, Separator for seq1/seq2 in the input training file. Defaults to ' '.
--iter ITER, Maximum number of iterations for EM. Defaults to 10.
--ifile IFILE, File containing training sequences.
--ofile OFILE, Write the binary fst model to file.
--noalign, Do not align. Assume that the aligned corpus already exists.
Defaults to false.
--order ORDER, N-gram order. Defaults to 9.
--smooth SMOOTH, Smoothing method. Available options are:
"presmoothed", "unsmoothed", "kneser_ney", "absolute",
"katz", "witten_bell", "unsmoothed". Defaults to "kneser_ney".
$

As in [1], the two required options are the pronunciation dictionary (IFILE) and the file in which the binary fst model will be saved (OFILE). The script provide default values for all other options and an fst binary model for cmudict (v. 0.7a) can be created simply by the following command

$ ./train --seq1_del --seq2_del --ifile <path to cmudict> --ofile <path to binary fst>

allowing for deletions in both graphemes and phonemes, and

$ ./train --ifile <path to cmudict> --ofile <path to binary fst>

not allowing for deletions.

3. Performance, Evaluation and Comparison with phonetisaurus
in order to test the new code’s performance, tests similar to those in [1] and [2] where performed, with similar results in both resource utilization and it’s ability to generate pronunciations for previously unseen words.

4. Conclusion – Future Works
Having integrated the model training procedure into a simplified application, combined with the dictionary alignment code, the next step would be to create the evaluation code in order to avoid using phonetisaurus evaluate python script. Further steps include the writing of the necessary code to load the WFST binary model in java code, and convert it to the java’s implementation of openFST [3], [4].

References
[1] J. Salatas, “Porting phonetisaurus many-to-many alignment python script to C++”, ICT Research Blog, May 2012.
[2] J. Salatas, “Using OpenGrm NGram Library for the encoding of joint multigram language models as WFST”, ICT Research Blog, June 2012.
[3] J. Salatas, “Porting openFST to java: Part 1”, ICT Research Blog, May 2012.
[4] J. Salatas, “Porting openFST to java: Part 2”, ICT Research Blog, May 2012.

Using OpenGrm NGram Library for the encoding of joint multigram language models as WFST

(originally posted at http://cmusphinx.sourceforge.net/2012/06/using-opengrm-ngram-library-for-the-encoding-of-joint-multigram-language-models-as-wfst/)

Foreword
This article will review the OpenGrm NGram Library [1] and its usage for language modeling in ASR. OpenGrm makes use of functionality in the openFST library [2] to create, access and manipulate n-gram language models and it can be used as the language model training toolkit for integrating phonetisaurus’ model training procedures [3] into a simplified application.

1. Model Training
Having the aligned corpus produced from cmudict by the aligner code of our previous article [4], the first step is to generate an OpenFst-style symbol table for the text tokens in input corpus. This can be done with: [5]

# ngramsymbols < cmudict.corpus > cmudict.syms

Given the symbol table, a text corpus can be converted to a binary finite state archive (far) [6] with:

# farcompilestrings --symbols=cmudict.syms --keep_symbols=1 cmudict.corpus > cmudict.far

Next step is to count n-grams from an input corpus, converted in FAR format. It produces an n-gram model in the FST format. By using the switch –order the maximum length n-gram to count can be chosen.

The 1-gram through 9-gram counts for the cmudict.far finite-state archive file created above can be created with:

# ngramcount --order=9 cmudict.far > cmudict.cnts

Finally the 9-gram counts in cmudict.cnts above can be converted to a WFST model with:

# ngrammake --method="kneser_ney" cmudict.cnts > cmudict.mod

The –method option is used for selecting the smoothing method [7] from one of the six available:

  • witten_bell: smooths using Witten-Bell [8], with a hyperparameter k, as presented in [9].
  • absolute: smooths based on Absolute Discounting [10], using bins and discount parameters.
  • katz: smooths based on Katz Backoff [11], using bins parameters.
  • kneser_ney: smooths based on Kneser-Ney [12], a variant of Absolute Discounting.
  • presmoothed: normalizes at each state based on the n-gram count of the history.
  • unsmoothed: normalizes the model but provides no smoothing.

2. Evaluation – Comparison with phonetisaurus

In order to evaluate OpenGrm models, ther procedure described above was repeated using the standard 90%-10% split of the cmudict into a training and test set respectively. The binary fst format produced by ngrammake wasn’t readable by the phonetisaurus evaluation script, so it was converted to ARPA format with:

# ngramprint --ARPA cmudict.mod > cmudict.arpa

and then back to a phonetisaurus binary fst format with:

# phonetisaurus-arpa2fst --input=cmudict.arpa --prefix="cmudict/cmudict"

Finally the test set was evaluated with

# evaluate.py --modelfile cmudict/cmudict.fst --testfile cmudict.dict.test --prefix cmudict/cmudict
Words: 13328 Hyps: 13328 Refs: 13328
##############################################
EVALUATION RESULTS
---------------------------------------------------------------------
(T)otal tokens in reference: 84955
(M)atches: 77165 (S)ubstitutions: 7044 (I)nsertions: 654 (D)eletions: 746
% Correct (M/T) -- %90.83
% Token ER ((S+I+D)/T) -- %9.94
% Accuracy 1.0-ER -- %90.06
---------------------------------------------------------------------
(S)equences: 13328 (C)orrect sequences: 8010 (E)rror sequences: 5318
% Sequence ER (E/S) -- %39.90
% Sequence Acc (1.0-E/S) -- %60.10
##############################################

3. Conclusion – Future Works

The evaluation results above are, as expected, almost identical with those produced by the phonetisaurus’ procedures and the use of MITLM toolkit instead of OpenGrm.

Having the above description, the next step is to integrate all of the commands above into a simplified application, combined with the dictionary alignment code introduced in our previous article [13].

References

[1] OpenGrm NGram Library

[2] openFST library

[3] “Phonetisaurus: A WFST-driven Phoneticizer – Framework Review”, ICT Research Blog, May 2012.

[4] “Porting phonetisaurus many-to-many alignment python script to C++”, ICT Research Blog, May 2012.

[5] OpenGrm NGram Library Quick Tour

[6] OpenFst Extensions: FST Archives (FARs)

[7] S.F. Chen, G. Goodman, “An Empirical Study of Smoothing Techniques for Language Modeling”, Harvard Computer Science Technical report TR-10-98, 1998.

[8] I. H. Witten, T. C. Bell, “The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression”, IEEE Transactions on Information Theory 37 (4), pp. 1085–1094, 1991.

[9] B. Carpenter, “Scaling high-order character language models to gigabytes”, Proceedings of the ACL Workshop on Software, pp. 86–99, 2005.

[10] H. Ney, U. Essen, R. Kneser, “On structuring probabilistic dependences in stochastic language modeling”, Computer Speech and Language 8, pp. 1–38, 1994.

[11] S. M. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recogniser”, IEEE Transactions on Acoustics, Speech, and Signal Processing 35 (3), pp. 400–401, 1987.

[12] R. Kneser, H. Ney, “Improved backing-off for m-gram language modeling”, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). pp. 181–184, 1995.

[13] Porting phonetisaurus many-to-many alignment python script to C++