snowfall icon indicating copy to clipboard operation
snowfall copied to clipboard

WIP: Add BPE training with LF-MMI.

Open csukuangfj opened this issue 4 years ago • 2 comments
trafficstars

A small vocab_size, e.g., 200, is used to avoid OOM if the bigram P is used. After removing P, it is possible to use a large vocab size, e.g., 5000.

@glynpu is doing BPE CTC training. We can use his implementation once it's ready. This pull-request is for experimental purpose.


Will add decoding code later.

--

The training is still on-going. The tensorboard training log is available at https://tensorboard.dev/experiment/CN5yTQNmTLODdyLZA6K8rQ/#scalars&runSelectionState=eyIuIjp0cnVlfQ%3D%3D

csukuangfj avatar Jun 19 '21 12:06 csukuangfj

BTW, the way I think we can solve the memory-blowup issue is: (i) use the new, more compact CTC topo (ii) train a bigram ARPA LM to make a compact LM, e.g. with kaldi's make_kn_lm.py; load it into k2 as P (no disambig symbols!), and remove epsilons. k2 uses a rm-epsilon algorithm that should keep the epsilon-free LM compact, unlike OpenFst which would cause it to blow up. BTW, I am asking some others to add a pruning option to make_kn_lm.py.

danpovey avatar Jun 20 '21 02:06 danpovey

(i) use the new, more compact CTC topo

Yes, I am using the new CTC topo.

(ii) train a bigram ARPA LM to make a compact LM, e.g. with kaldi's make_kn_lm.py;

Will update the code to train a word piece bigram ARPA LM with make_kn_lm.py.

csukuangfj avatar Jun 20 '21 02:06 csukuangfj