snowfall
snowfall copied to clipboard
input label is float in G.fas.txt
I follow the librispeech recipe of snowfall, I make our own private data. NOTE: The private data is running successfully in the Kaldi platform. When I use the below command to prepared L , G, but I find that it may contain an input label is a float in G.fas.txt, it doesn't match the format of k2,
prepared L
local/prepare_lang.sh \
--position-dependent-phones false \
data/local/dict "<unk>" \
data/local/lang_tmp_nosp \
data/lang_nosp
prepared G
$ mkdir data/local/lm_3gram
$ sort data/lang_nosp/words.txt | awk '{print $1}' | grep -v '\#0' | grep -v '<eps>' | grep -v -F "<UNK>" > data/local/lm_3gram/vocab
$ cat train/text | cut -f2- -d' '> data/local/lm_3gram/train.txt
$ sed 's/<UNK>/<unk>/g' data/local/lm_3gram/train.txt | ngram-count -lm - -order 3 -text - -vocab data/local/lm_3gram/vocab -unk -sort -maxent -maxent-convert-to-arpa|sed 's/<unk>/<UNK>/g' > data/local/lm_3gram/3gram.arpa
# Build G
local/arpa2fst.py data/local/lm_3gram/3gram.arpa |
local/sym2int.pl -f 3 data/lang_nosp/words.txt >data/lang_nosp/G.fsa.txt
Some data in G.fsa.txt
3 1 34797 0.07946522794569635
1 3 3885 10.790068018971331
1 2.5243010115984426
4 1 34797 -0.0
1 4 17155 13.309964185286875
5 1 34797 0.3504506880515822
1 5 8.18207264441274
I can remove the error line in G.fas.txt, but I am curious why this happens.
Does the tool arpa2fst from kaldi generate a correct G.fst using data/local/lm_3gram/3gram.arpa as input?
The incorrect line may be generated by https://github.com/k2-fsa/snowfall/blob/09512c3978cd0c614a7580e724767f3d509e2ab9/egs/librispeech/asr/simple_v1/local/arpa2fst.py#L111
with sub_eps = ''.
There are two spaces between 5 and 8 in
1 5 8.18207264441274
Does the tool arpa2fst from kaldi generate a correct G.fst using data/local/lm_3gram/3gram.arpa as input?
It should be correct via Kaldi c++ code command
[md510@node06 maison2_d3_asr]$ arpa2fst --disambig-symbol=#0 --read-symbol-table=/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lang_nosp/words.txt /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/3gram.arpa /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst
arpa2fst --disambig-symbol=#0 --read-symbol-table=/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lang_nosp/words.txt /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/3gram.arpa /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst
LOG (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:94) Reading \data\ section.
LOG (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:149) Reading \1-grams: section.
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 12 [-3.553429 <UNK> -0.1521988] skipped: word '<UNK>' not in symbol table
LOG (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:149) Reading \2-grams: section.
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 34817 [-3.932194 <s> <UNK> -0.1743848] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39229 [-0.9002573 <UNK> </s>] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39230 [-1.613999 <UNK> <UNK> -0.4021042] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39231 [-2.296686 <UNK> <v-noise> -0.08156343] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39232 [-1.326035 <UNK> ah -0.1199462] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39233 [-3.21014 <UNK> aim -0.06807161] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39234 [-2.662222 <UNK> already -0.004396959] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39235 [-2.458678 <UNK> and -0.003280976] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39236 [-4.453877 <UNK> arvogra -0.02306214] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39237 [-2.105228 <UNK> birthday -0.05440236] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39238 [-3.764909 <UNK> chong -0.04804957] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39239 [-2.329228 <UNK> class -0.04189709] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39240 [-3.040592 <UNK> clear -0.03063362] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39241 [-4.055021 <UNK> coaster -0.01942559] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39242 [-3.952355 <UNK> compete -0.07786922] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39243 [-3.544479 <UNK> competitive -0.04858544] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39244 [-4.446651 <UNK> cube -0.07530819] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39245 [-2.945884 <UNK> day -0.1040725] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39246 [-3.145669 <UNK> econs -0.08970822] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39247 [-2.501991 <UNK> edwin -0.00977387] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39248 [-3.019146 <UNK> enough -0.03975383] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39249 [-3.398197 <UNK> european -0.06885289] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39250 [-2.681935 <UNK> from -0.05035353] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39251 [-3.201609 <UNK> function -0.1069543] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39252 [-2.460359 <UNK> go -0.01941609] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39253 [-2.723601 <UNK> he -0.09909758] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39254 [-2.331394 <UNK> i -0.006897578] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39255 [-3.155031 <UNK> ice 0] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39256 [-2.511648 <UNK> in -0.02830878] skipped: word '<UNK>' not in symbol table
LOG (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:149) Reading \3-grams: section.
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:259) Of 590 parse warnings, 30 were reported. Run program with --max_warnings=-1 to see all warnings
LOG (arpa2fst[5.5.644~1-0d24]:RemoveRedundantStates():arpa-lm-compiler.cc:359) Reduced num-states from 287426 to 268112
Yes,I agree with you. the snowfall/egs/librispeech/asr/simple_v1/local/arpa2fst.py can't handle sub_eps = ''
I am going to wrap kaldi's arpa2fst to Python, like what Piotr has done to edit distance https://github.com/pzelasko/kaldialign
Thanks a lot. @csukuangfj. While I find that the result is interesting.
Case 1: I use snowfall/egs/librispeech/asr/simple_v1/local/arpa2fst.py get G.fsa.txt and remove the incorrect lines in G.fas.txt as G.fsa.txt1, then decode the test set, get WER of the test set
Case2: I use the below command of Kaldi to get G.fst.txt, then convert G.fst.txt to the format (G.fst.txt)of k2 via removing the output label.
arpa2fst --disambig-symbol=#0 --read-symbol-table=/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lang_nosp/words.txt /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/3gram.arpa /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst
fstprint /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst.txt
awk '{print $1,$2,$3,$5}' /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst.txt > /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fsa.txt2
cp -r /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fsa.txt2 data/lang_nosp/
Then I use /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fsa.txt to get the decoding graph. finally, I get WER of the test set.
I found that
case1 : $ wc -l data/lang_nosp/G.fsa.txt1
1238901 data/lang_nosp/G.fsa.txt1
case2: $ wc -l data/lang_nosp/G.fsa.txt2
1219049 data/lang_nosp/G.fsa.txt2
BUT,In both cases, the WER of the test set is the same.
Kaldi's arpa2fst is not 100% simple, especially in cases where you are dealing with a language model that was pruned. There are cases, with pruned LMs, where all the n-grams leaving an LM state are pruned away but the LM state is kept because either there is an LM state that backs off to it, or there are higher-order LM states that transition to it (i.e. after consuming a word). In these cases it's possible to "bypass" transitions to the state by going directly to its backoff state. We made a change a year or two ago to do that.
On Tue, Dec 29, 2020 at 11:46 AM shanguanma [email protected] wrote:
Thanks a lot. @csukuangfj https://github.com/csukuangfj. While I find that the result is interesting.
Case 1: I use snowfall/egs/librispeech/asr/simple_v1/local/arpa2fst.py get G.fsa.txt and remove the incorrect lines in G.fas.txt as G.fsa.txt1, then decode the test set, get WER of the test set
Case2: I use the below command of Kaldi to get G.fst.txt, then convert G.fst.txt to the format (G.fst.txt)of k2 via removing the output label.
arpa2fst --disambig-symbol=#0 --read-symbol-table=/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lang_nosp/words.txt /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/3gram.arpa /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst
fstprint /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst.txt
awk '{print $1,$2,$3,$5}' /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst.txt > /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fsa.txt2
cp -r /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fsa.txt2 data/lang_nosp/
Then I use /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fsa.txt to get the decoding graph. finally, I get WER of the test set. I found that
case1 : $ wc -l data/lang_nosp/G.fsa.txt1
1238901 data/lang_nosp/G.fsa.txt1
case2: $ wc -l data/lang_nosp/G.fsa.txt2
1219049 data/lang_nosp/G.fsa.txt2
BUT,In both cases, the WER of the test set is the same.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/60#issuecomment-751936340, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO27GHCBVW5MTRKGOYDSXFGJLANCNFSM4VMCOK6A .
I just wrapped kaldi's arpa2fst to Python.
You can use
pip install kaldilm
to install it.
Please refer to https://github.com/csukuangfj/kaldilm for its usage.
Also, there is a colab notebook for demonstration purpose: https://colab.research.google.com/drive/1rTGQiDDlhE8ezTH4kmR4m8vlvs6lnl6Z?usp=sharing
@shanguanma
Could you replace local/arpa2fst.py with kaldilm.arpa2fst and try again?
@danpovey, Thank you for your explanation. @csukuangfj , Yes, I will use the package, then to decode the test set. Once the result comes out, I will report it here.
@csukuangfj, I follow your readme of kaldilm, then prepare G.fas.txt via below command.
python3 local/kaldi_lm.py \
--arpa_lm data/local/lm_3gram/3gram.arpa \
--fst_lm data/local/lm_3gram/G.fsa \
--words_map data/local/lm_3gram/words.txt \
--fsa_lm data/local/lm_3gram/G.fsa.txt3
cp -r data/local/lm_3gram/G.fsa.txt3 data/lang_nosp/
local/kaldi_lm.py content is as follows:
import kaldilm
import argparse
def get_parse():
parser = argparse.ArgumentParser(
description="remove not correct line of G.fas.txt "
"e.g: 1 5 8.18207264441274 ,its input label should be inter not be float"
)
parser.add_argument(
"--arpa_lm", type=str, help="input arpa style lm file path, "
)
parser.add_argument(
"--fst_lm", type=str, help="output binary style lm file in OpenFST format path"
)
parser.add_argument(
"--words_map", type=str, help="output word and it correspoding id file path"
)
parser.add_argument(
"--fsa_lm", type=str, help="output text format of arap_lm with integer label in k2 format"
)
args = parser.parse_args()
return args
def main():
args = get_parse()
s = kaldilm.arpa2fst(args.arpa_lm, args.fst_lm, write_symbol_table=args.words_map)
with open(args.fsa_lm, 'w') as f:
f.write(s)
if __name__== "__main__":
main()
BUT , when constructed Determinizing L*G, it is very slow, it does not finish about 2 hours, NOTE: the private corpus is about 100 hours, it is very fast to complete in case1 and case2.
2020-12-30 08:20:50,019 INFO [decode_seame.py:145] Loading L_disambig.fst.txt
2020-12-30 08:20:50,301 INFO [decode_seame.py:148] Loading G.fsa.txt
2020-12-30 08:20:51,547 DEBUG [graph.py:32] Intersecting L and G
2020-12-30 08:20:54,139 DEBUG [graph.py:34] LG shape = (2241818, None)
2020-12-30 08:20:54,139 DEBUG [graph.py:35] Connecting L*G
2020-12-30 08:20:54,244 DEBUG [graph.py:37] LG shape = (2241818, None)
2020-12-30 08:20:54,244 DEBUG [graph.py:38] Determinizing L*G
@shanguanma Thanks for your feedback.
From your case 2:
arpa2fst --disambig-symbol=#0 \
--read-symbol-table=/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lang_nosp/words.txt \
/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/3gram.arpa \
/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst
fstprint /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst \
/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst.txt
awk '{print $1,$2,$3,$5}' \
/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst.txt \
> /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fsa.txt2
cp -r /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fsa.txt2 data/lang_nosp/
Could you try the following code, which I think shuold produce the same FST with case2 (except that args.fsa_lm is an FSA, so you do not need to process it with awk)
s = kaldilm.arpa2fst(
args.arpa_lm,
args.fst_lm,
read_symbol_table='/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lang_nosp/words.txt',
disambig_symbol='#0'
)
with open(args.fsa_lm, 'w') as f:
f.write(s)
when constructed Determinizing L*G, it is very slow,
The reason may be that G uses a different word symbol table with L as you did not pass read_symbol_table to kaldilm.arpa2fst.
@csukuangfj, because case1 and case2 WER of the test set is exactly the same, I double-check it,I find that when I decode it in case2, I still use LG.pt of case1, this is not correct. So I redecode it in case1, case2, and case3.I find that results of case1 and results of case1 are almost the same. All result is as follows:
case 1:
(k2-fsa) ntu-dso@ntudso-X9DA7-E:~/w2020/k2-fsa/snowfall/egs/seame$ grep -rn "dev_sge" exp-lstm-adam/results_details1.txt
23708:cuts_dev_sge.json.gz: %WER 57.69% [31389 / 54408, 1550 ins, 9197 del, 20642 sub ]
(k2-fsa) ntu-dso@ntudso-X9DA7-E:~/w2020/k2-fsa/snowfall/egs/seame$ grep -rn "dev_man" exp-lstm-adam/results_details1.txt
13064:cuts_dev_man.json.gz: %WER 43.46% [42038 / 96738, 2946 ins, 10484 del, 28608 sub ]
case 2:
(k2-fsa) ntu-dso@ntudso-X9DA7-E:~/w2020/k2-fsa/snowfall/egs/seame$ grep -rn "dev_sge" exp-lstm-adam/results_details2.txt
23708:cuts_dev_sge.json.gz: %WER 57.99% [31552 / 54408, 1462 ins, 9794 del, 20296 sub ]
(k2-fsa) ntu-dso@ntudso-X9DA7-E:~/w2020/k2-fsa/snowfall/egs/seame$ grep -rn "dev_man" exp-lstm-adam/results_details2.txt
13064:cuts_dev_man.json.gz: %WER 43.53% [42111 / 96738, 2841 ins, 10889 del, 28381 sub ]
case 3:
(k2-fsa) ntu-dso@ntudso-X9DA7-E:~/w2020/k2-fsa/snowfall/egs/seame$ grep -rn "dev_sge" exp-lstm-adam/results_details4.txt
23708:cuts_dev_sge.json.gz: %WER 57.99% [31552 / 54408, 1462 ins, 9794 del, 20296 sub ]
(k2-fsa) ntu-dso@ntudso-X9DA7-E:~/w2020/k2-fsa/snowfall/egs/seame$ grep -rn "dev_man" exp-lstm-adam/results_details4.txt
13064:cuts_dev_man.json.gz: %WER 43.53% [42111 / 96738, 2841 ins, 10889 del, 28381 sub ]
Note: it should be MER% because the corpus is English-Chinese code-switch, Chinese is character unit, English is word unit
BUT , Currently, the results of k2 is worse than the results of Kaldi.
Kaldi result(WER%, it should be MER%, because the corpus is English-Chinese code-switch, Chinese is a character unit, English is word unit):
dev_sge: 25.71%
dev_man: 18.98%
Note: @csukuangfj, the below code
s = kaldilm.arpa2fst(
args.arpa_lm,
args.fst_lm,
read_symbol_table='/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lang_nosp/words.txt',
disambig_symbol='#0'
)
with open(args.fsa_lm, 'w') as f:
f.write(s)
output args,fas_lm is fst format in OpenFST, not fsa format of k2.
I find that results of case1 and results of case1 are almost the same.
The WERs for case 2 and case 3 should be identical, as they generate the same G.fst
output args,fas_lm is fst format in OpenFST, not fsa format of k2.
Yes, you are right. kaldilm.arpa2fst should produce identical output with kaldi's arpa2fst when they are given the same input.
Since kaldi uses OpenFST, so the output is in OpenFST format.
Currently, the results of k2 is worse than the results of Kalid
The network in snowfall has not been tuned. But I think you know now how simple it is to train a CTC network with k2. Hope that once LF-MMI training is done, k2 can achieve comparable WERs with kaldi.
Yes, Thanks a lot. Once the LF-MMI training code of k2 is done, I will follow it.