snowfall icon indicating copy to clipboard operation
snowfall copied to clipboard

input label is float in G.fas.txt

Open shanguanma opened this issue 4 years ago • 14 comments

I follow the librispeech recipe of snowfall, I make our own private data. NOTE: The private data is running successfully in the Kaldi platform. When I use the below command to prepared L , G, but I find that it may contain an input label is a float in G.fas.txt, it doesn't match the format of k2,

prepared L
local/prepare_lang.sh \
     --position-dependent-phones false \
     data/local/dict "<unk>" \
     data/local/lang_tmp_nosp \
     data/lang_nosp
prepared G
$ mkdir  data/local/lm_3gram
$ sort data/lang_nosp/words.txt | awk '{print $1}' | grep -v '\#0' | grep -v '<eps>' | grep -v -F "<UNK>" >  data/local/lm_3gram/vocab
$ cat train/text | cut -f2- -d' '>  data/local/lm_3gram/train.txt
$ sed 's/<UNK>/<unk>/g'  data/local/lm_3gram/train.txt | ngram-count -lm - -order 3 -text - -vocab   data/local/lm_3gram/vocab -unk -sort -maxent -maxent-convert-to-arpa|sed 's/<unk>/<UNK>/g'  >  data/local/lm_3gram/3gram.arpa

# Build G
   local/arpa2fst.py data/local/lm_3gram/3gram.arpa |
          local/sym2int.pl -f 3 data/lang_nosp/words.txt >data/lang_nosp/G.fsa.txt

Some data in G.fsa.txt

3 1 34797 0.07946522794569635
1 3 3885 10.790068018971331
1 2.5243010115984426
4 1 34797 -0.0
1 4 17155 13.309964185286875
5 1 34797 0.3504506880515822
1 5  8.18207264441274

I can remove the error line in G.fas.txt, but I am curious why this happens.

shanguanma avatar Dec 28 '20 14:12 shanguanma

Does the tool arpa2fst from kaldi generate a correct G.fst using data/local/lm_3gram/3gram.arpa as input?

csukuangfj avatar Dec 28 '20 14:12 csukuangfj

The incorrect line may be generated by https://github.com/k2-fsa/snowfall/blob/09512c3978cd0c614a7580e724767f3d509e2ab9/egs/librispeech/asr/simple_v1/local/arpa2fst.py#L111

with sub_eps = ''.

There are two spaces between 5 and 8 in

1 5  8.18207264441274

csukuangfj avatar Dec 28 '20 14:12 csukuangfj

Does the tool arpa2fst from kaldi generate a correct G.fst using data/local/lm_3gram/3gram.arpa as input?

It should be correct via Kaldi c++ code command

[md510@node06 maison2_d3_asr]$ arpa2fst  --disambig-symbol=#0 --read-symbol-table=/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lang_nosp/words.txt  /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/3gram.arpa /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst
arpa2fst --disambig-symbol=#0 --read-symbol-table=/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lang_nosp/words.txt /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/3gram.arpa /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst 
LOG (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:94) Reading \data\ section.
LOG (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:149) Reading \1-grams: section.
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 12 [-3.553429	<UNK>	-0.1521988] skipped: word '<UNK>' not in symbol table
LOG (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:149) Reading \2-grams: section.
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 34817 [-3.932194	<s> <UNK>	-0.1743848] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39229 [-0.9002573	<UNK> </s>] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39230 [-1.613999	<UNK> <UNK>	-0.4021042] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39231 [-2.296686	<UNK> <v-noise>	-0.08156343] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39232 [-1.326035	<UNK> ah	-0.1199462] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39233 [-3.21014	<UNK> aim	-0.06807161] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39234 [-2.662222	<UNK> already	-0.004396959] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39235 [-2.458678	<UNK> and	-0.003280976] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39236 [-4.453877	<UNK> arvogra	-0.02306214] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39237 [-2.105228	<UNK> birthday	-0.05440236] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39238 [-3.764909	<UNK> chong	-0.04804957] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39239 [-2.329228	<UNK> class	-0.04189709] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39240 [-3.040592	<UNK> clear	-0.03063362] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39241 [-4.055021	<UNK> coaster	-0.01942559] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39242 [-3.952355	<UNK> compete	-0.07786922] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39243 [-3.544479	<UNK> competitive	-0.04858544] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39244 [-4.446651	<UNK> cube	-0.07530819] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39245 [-2.945884	<UNK> day	-0.1040725] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39246 [-3.145669	<UNK> econs	-0.08970822] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39247 [-2.501991	<UNK> edwin	-0.00977387] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39248 [-3.019146	<UNK> enough	-0.03975383] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39249 [-3.398197	<UNK> european	-0.06885289] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39250 [-2.681935	<UNK> from	-0.05035353] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39251 [-3.201609	<UNK> function	-0.1069543] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39252 [-2.460359	<UNK> go	-0.01941609] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39253 [-2.723601	<UNK> he	-0.09909758] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39254 [-2.331394	<UNK> i	-0.006897578] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39255 [-3.155031	<UNK> ice	0] skipped: word '<UNK>' not in symbol table
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:219) line 39256 [-2.511648	<UNK> in	-0.02830878] skipped: word '<UNK>' not in symbol table
LOG (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:149) Reading \3-grams: section.
WARNING (arpa2fst[5.5.644~1-0d24]:Read():arpa-file-parser.cc:259) Of 590 parse warnings, 30 were reported. Run program with --max_warnings=-1 to see all warnings
LOG (arpa2fst[5.5.644~1-0d24]:RemoveRedundantStates():arpa-lm-compiler.cc:359) Reduced num-states from 287426 to 268112

Yes,I agree with you. the snowfall/egs/librispeech/asr/simple_v1/local/arpa2fst.py can't handle sub_eps = ''

shanguanma avatar Dec 28 '20 14:12 shanguanma

I am going to wrap kaldi's arpa2fst to Python, like what Piotr has done to edit distance https://github.com/pzelasko/kaldialign

csukuangfj avatar Dec 29 '20 01:12 csukuangfj

Thanks a lot. @csukuangfj. While I find that the result is interesting.

Case 1: I use snowfall/egs/librispeech/asr/simple_v1/local/arpa2fst.py get G.fsa.txt and remove the incorrect lines in G.fas.txt as G.fsa.txt1, then decode the test set, get WER of the test set

Case2: I use the below command of Kaldi to get G.fst.txt, then convert G.fst.txt to the format (G.fst.txt)of k2 via removing the output label.

arpa2fst  --disambig-symbol=#0 --read-symbol-table=/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lang_nosp/words.txt  /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/3gram.arpa /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst

fstprint  /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst  /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst.txt

awk '{print $1,$2,$3,$5}' /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst.txt > /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fsa.txt2
cp -r /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fsa.txt2  data/lang_nosp/

Then I use /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fsa.txt to get the decoding graph. finally, I get WER of the test set. I found that

case1 : $ wc -l  data/lang_nosp/G.fsa.txt1
1238901 data/lang_nosp/G.fsa.txt1
case2: $ wc -l  data/lang_nosp/G.fsa.txt2
1219049 data/lang_nosp/G.fsa.txt2

BUT,In both cases, the WER of the test set is the same.

shanguanma avatar Dec 29 '20 03:12 shanguanma

Kaldi's arpa2fst is not 100% simple, especially in cases where you are dealing with a language model that was pruned. There are cases, with pruned LMs, where all the n-grams leaving an LM state are pruned away but the LM state is kept because either there is an LM state that backs off to it, or there are higher-order LM states that transition to it (i.e. after consuming a word). In these cases it's possible to "bypass" transitions to the state by going directly to its backoff state. We made a change a year or two ago to do that.

On Tue, Dec 29, 2020 at 11:46 AM shanguanma [email protected] wrote:

Thanks a lot. @csukuangfj https://github.com/csukuangfj. While I find that the result is interesting.

Case 1: I use snowfall/egs/librispeech/asr/simple_v1/local/arpa2fst.py get G.fsa.txt and remove the incorrect lines in G.fas.txt as G.fsa.txt1, then decode the test set, get WER of the test set

Case2: I use the below command of Kaldi to get G.fst.txt, then convert G.fst.txt to the format (G.fst.txt)of k2 via removing the output label.

arpa2fst --disambig-symbol=#0 --read-symbol-table=/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lang_nosp/words.txt /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/3gram.arpa /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst

fstprint /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst.txt

awk '{print $1,$2,$3,$5}' /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst.txt > /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fsa.txt2

cp -r /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fsa.txt2 data/lang_nosp/

Then I use /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fsa.txt to get the decoding graph. finally, I get WER of the test set. I found that

case1 : $ wc -l data/lang_nosp/G.fsa.txt1

1238901 data/lang_nosp/G.fsa.txt1

case2: $ wc -l data/lang_nosp/G.fsa.txt2

1219049 data/lang_nosp/G.fsa.txt2

BUT,In both cases, the WER of the test set is the same.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/60#issuecomment-751936340, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO27GHCBVW5MTRKGOYDSXFGJLANCNFSM4VMCOK6A .

danpovey avatar Dec 29 '20 04:12 danpovey

I just wrapped kaldi's arpa2fst to Python.

You can use

pip install kaldilm

to install it.

Please refer to https://github.com/csukuangfj/kaldilm for its usage.

Also, there is a colab notebook for demonstration purpose: https://colab.research.google.com/drive/1rTGQiDDlhE8ezTH4kmR4m8vlvs6lnl6Z?usp=sharing

@shanguanma Could you replace local/arpa2fst.py with kaldilm.arpa2fst and try again?

csukuangfj avatar Dec 29 '20 12:12 csukuangfj

@danpovey, Thank you for your explanation. @csukuangfj , Yes, I will use the package, then to decode the test set. Once the result comes out, I will report it here.

shanguanma avatar Dec 29 '20 13:12 shanguanma

@csukuangfj, I follow your readme of kaldilm, then prepare G.fas.txt via below command.

python3 local/kaldi_lm.py  \
            --arpa_lm data/local/lm_3gram/3gram.arpa \
            --fst_lm data/local/lm_3gram/G.fsa \
            --words_map data/local/lm_3gram/words.txt \
            --fsa_lm data/local/lm_3gram/G.fsa.txt3

   cp -r  data/local/lm_3gram/G.fsa.txt3  data/lang_nosp/

local/kaldi_lm.py content is as follows:

import kaldilm
import argparse
def get_parse():
    parser = argparse.ArgumentParser(
        description="remove not correct line of G.fas.txt "
                    "e.g:  1 5  8.18207264441274 ,its input label should be inter not be float"
    )
    parser.add_argument(
        "--arpa_lm", type=str, help="input arpa style lm file path, "
    )
    parser.add_argument(
        "--fst_lm", type=str, help="output binary style lm file in OpenFST format path"
    )
    parser.add_argument(
        "--words_map", type=str, help="output word and it correspoding id file path"
    )

    parser.add_argument(
        "--fsa_lm", type=str, help="output text format of arap_lm  with integer label in k2 format"
    )

    args = parser.parse_args()
    return args

def main():
   args = get_parse()
   s = kaldilm.arpa2fst(args.arpa_lm, args.fst_lm, write_symbol_table=args.words_map)
   with open(args.fsa_lm, 'w') as f:
       f.write(s)


if __name__== "__main__":
    main()

BUT , when constructed Determinizing L*G, it is very slow, it does not finish about 2 hours, NOTE: the private corpus is about 100 hours, it is very fast to complete in case1 and case2.

2020-12-30 08:20:50,019 INFO [decode_seame.py:145] Loading L_disambig.fst.txt
2020-12-30 08:20:50,301 INFO [decode_seame.py:148] Loading G.fsa.txt
2020-12-30 08:20:51,547 DEBUG [graph.py:32] Intersecting L and G
2020-12-30 08:20:54,139 DEBUG [graph.py:34] LG shape = (2241818, None)
2020-12-30 08:20:54,139 DEBUG [graph.py:35] Connecting L*G
2020-12-30 08:20:54,244 DEBUG [graph.py:37] LG shape = (2241818, None)
2020-12-30 08:20:54,244 DEBUG [graph.py:38] Determinizing L*G

shanguanma avatar Dec 30 '20 02:12 shanguanma

@shanguanma Thanks for your feedback.

From your case 2:

arpa2fst  --disambig-symbol=#0 \
--read-symbol-table=/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lang_nosp/words.txt  \
/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/3gram.arpa \
/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst

fstprint  /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst  \
/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst.txt

awk '{print $1,$2,$3,$5}' \
/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fst.txt \
> /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fsa.txt2

cp -r /home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lm_3gram/G.fsa.txt2  data/lang_nosp/

Could you try the following code, which I think shuold produce the same FST with case2 (except that args.fsa_lm is an FSA, so you do not need to process it with awk)

s = kaldilm.arpa2fst(
args.arpa_lm, 
args.fst_lm, 
read_symbol_table='/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lang_nosp/words.txt',
disambig_symbol='#0'
)
with open(args.fsa_lm, 'w') as f:
   f.write(s)

csukuangfj avatar Dec 30 '20 03:12 csukuangfj

when constructed Determinizing L*G, it is very slow,

The reason may be that G uses a different word symbol table with L as you did not pass read_symbol_table to kaldilm.arpa2fst.

csukuangfj avatar Dec 30 '20 03:12 csukuangfj

@csukuangfj, because case1 and case2 WER of the test set is exactly the same, I double-check it,I find that when I decode it in case2, I still use LG.pt of case1, this is not correct. So I redecode it in case1, case2, and case3.I find that results of case1 and results of case1 are almost the same. All result is as follows:

case 1:
(k2-fsa) ntu-dso@ntudso-X9DA7-E:~/w2020/k2-fsa/snowfall/egs/seame$ grep -rn "dev_sge" exp-lstm-adam/results_details1.txt
23708:cuts_dev_sge.json.gz: %WER 57.69% [31389 / 54408, 1550 ins, 9197 del, 20642 sub ]
(k2-fsa) ntu-dso@ntudso-X9DA7-E:~/w2020/k2-fsa/snowfall/egs/seame$ grep -rn "dev_man" exp-lstm-adam/results_details1.txt
13064:cuts_dev_man.json.gz: %WER 43.46% [42038 / 96738, 2946 ins, 10484 del, 28608 sub ]

case 2:
(k2-fsa) ntu-dso@ntudso-X9DA7-E:~/w2020/k2-fsa/snowfall/egs/seame$ grep -rn "dev_sge" exp-lstm-adam/results_details2.txt
23708:cuts_dev_sge.json.gz: %WER 57.99% [31552 / 54408, 1462 ins, 9794 del, 20296 sub ]
(k2-fsa) ntu-dso@ntudso-X9DA7-E:~/w2020/k2-fsa/snowfall/egs/seame$ grep -rn "dev_man" exp-lstm-adam/results_details2.txt
13064:cuts_dev_man.json.gz: %WER 43.53% [42111 / 96738, 2841 ins, 10889 del, 28381 sub ]


case 3:
(k2-fsa) ntu-dso@ntudso-X9DA7-E:~/w2020/k2-fsa/snowfall/egs/seame$ grep -rn "dev_sge" exp-lstm-adam/results_details4.txt
23708:cuts_dev_sge.json.gz: %WER 57.99% [31552 / 54408, 1462 ins, 9794 del, 20296 sub ]
(k2-fsa) ntu-dso@ntudso-X9DA7-E:~/w2020/k2-fsa/snowfall/egs/seame$ grep -rn "dev_man" exp-lstm-adam/results_details4.txt
13064:cuts_dev_man.json.gz: %WER 43.53% [42111 / 96738, 2841 ins, 10889 del, 28381 sub ]

Note: it should be MER% because the corpus is English-Chinese code-switch, Chinese is character unit, English is word unit

BUT , Currently, the results of k2 is worse than the results of Kaldi.

Kaldi result(WER%, it should be MER%, because the corpus is English-Chinese code-switch, Chinese is a character unit, English is word unit):
dev_sge: 25.71%
dev_man: 18.98%

Note: @csukuangfj, the below code

s = kaldilm.arpa2fst(
args.arpa_lm, 
args.fst_lm, 
read_symbol_table='/home4/md510/w2018/data/seame_3/kaldi_data_nopipe/lang_nosp/words.txt',
disambig_symbol='#0'
)
with open(args.fsa_lm, 'w') as f:
   f.write(s)

output args,fas_lm is fst format in OpenFST, not fsa format of k2.

shanguanma avatar Dec 30 '20 04:12 shanguanma

I find that results of case1 and results of case1 are almost the same.

The WERs for case 2 and case 3 should be identical, as they generate the same G.fst

output args,fas_lm is fst format in OpenFST, not fsa format of k2.

Yes, you are right. kaldilm.arpa2fst should produce identical output with kaldi's arpa2fst when they are given the same input.

Since kaldi uses OpenFST, so the output is in OpenFST format.


Currently, the results of k2 is worse than the results of Kalid

The network in snowfall has not been tuned. But I think you know now how simple it is to train a CTC network with k2. Hope that once LF-MMI training is done, k2 can achieve comparable WERs with kaldi.

csukuangfj avatar Dec 30 '20 05:12 csukuangfj

Yes, Thanks a lot. Once the LF-MMI training code of k2 is done, I will follow it.

shanguanma avatar Dec 30 '20 05:12 shanguanma