Wav2Vec2_PyCTCDecode Error when running eval script

I followed the tutorial and installed kenlm inside the subfolder kenlm and created the polish.arpa file and updated it.

I get the following error when running ./eval.py --language polish --path_to_ngram polish.arpa I'm running python 3.7 using conda on and ubuntu machine with GPU I manually installed pyctcdecode since it wasn't included in the requirements file.

./eval.py --language polish --path_to_ngram polish.arpa
Traceback (most recent call last):
  File "./eval.py", line 8, in <module>
    from pyctcdecode import build_ctcdecoder
  File "/home/bmoell/miniconda3/envs/wav2vec-nlp/lib/python3.7/site-packages/pyctcdecode/__init__.py", line 3, in <module>
    from .decoder import BeamSearchDecoderCTC, build_ctcdecoder  # noqa
  File "/home/bmoell/miniconda3/envs/wav2vec-nlp/lib/python3.7/site-packages/pyctcdecode/decoder.py", line 26, in <module>
    from .language_model import (
  File "/home/bmoell/miniconda3/envs/wav2vec-nlp/lib/python3.7/site-packages/pyctcdecode/language_model.py", line 55, in <module>
    def _prepare_unigram_set(unigrams: Collection[str], kenlm_model: kenlm.Model) -> Set[str]:
AttributeError: module 'kenlm' has no attribute 'Model'

Nov 04 '21 19:11 BirgerMoell

Ah I think I forgot to write that you also need to install this library here: https://github.com/kpu/kenlm#installation

Can you give the pip install command a try and see whether it works? :-)

Nov 04 '21 19:11 patrickvonplaten

In case it works it would be amazing if you could make a quick PR to update the requirements.txt and the README.md :-)

Nov 04 '21 19:11 patrickvonplaten

I installed using pip install https://github.com/kpu/kenlm/archive/master.zip I'm getting a new error.

My guess is that my polish.arpa file is misformed somehow but it's quite tricky to check since it's very slow to load and edit the file.

Here is how the beginning of the file looks. Since the error said it's expecting a tab, i suspect that there might be spaces somewhere where it should be tabs in the file?

\data\
ngram 1=86587
ngram 2=546387
ngram 3=796581
ngram 4=843999
ngram 5=850874

\1-grams:
-5.7532206      <unk>   0
0       <s>     -0.06677356
0       </s>     -0.06677356

Traceback (most recent call last):
  File "./eval.py", line 100, in <module>
    main(args)
  File "./eval.py", line 37, in main
    args.path_to_ngram,
  File "/home/bmoell/miniconda3/envs/wav2vec-nlp/lib/python3.7/site-packages/pyctcdecode/decoder.py", line 697, in build_ctcdecoder
    kenlm_model = None if kenlm_model_path is None else kenlm.Model(kenlm_model_path)
  File "kenlm.pyx", line 142, in kenlm.Model.__init__
OSError: Cannot read model 'polish.arpa' (lm/read_arpa.hh:51 in void lm::Read1Gram(util::FilePiece&, Voc&, Weights*, lm::PositiveProbWarn&) [with Voc = lm::ngram::ProbingVocabulary; Weights = lm::ProbBackoff] threw FormatLoadException because `f.get() != '\t''. Expected tab after probability in the 1-gram at byte 103 Byte: 103)

Nov 04 '21 21:11 BirgerMoell

yeah this looks like an issue with the .arpa file - a good debugging strategy would be to:

take less text to create the ngram -> instead of a 5 gram just do a 3 gram -> same trick with </s> -> quicker debugging cycle -> find bug -> correct -> apply same to large 5 gram.

Don't think this is related to the code here

Nov 04 '21 21:11 patrickvonplaten

@patrickvonplaten I believe the FormatLoadException error occurs when the file is changed (while following the instructions) via editor (vim, nano, even IDEs). They commonly mishandle the '\t', '\n' & other alike indentations.

Easiest way to apply your instructions without breaking the formatting would be:

Read the arpa via python & copy the 2 lines that needs to be changed.
Write a new arpa file by changing the lines while handling the formatting correctly like the script shows below.
Load it & have fun.

original = open('lang.arpa', 'r').readlines()
fixed = open('lang_fixed.arpa', 'w')

for line in original:
    if line == 'ngram 1=634704\n':
        fixed.write('ngram 1=634705\n')
    elif line == '0\t<s>\t-0.07692495\n':
        fixed.write('0\t<s>\t-0.07692495\n')
        fixed.write('0\t</s>\t-0.07692495\n')
    else:
        fixed.write(line)
fixed.close()

Dec 13 '21 11:12 deepconsc

Wav2Vec2_PyCTCDecode Wav2Vec2_PyCTCDecode copied to clipboard

Error when running eval script

Wav2Vec2_PyCTCDecode
Wav2Vec2_PyCTCDecode copied to clipboard