Wav2Vec2_PyCTCDecode
Wav2Vec2_PyCTCDecode copied to clipboard
Error when running eval script
I followed the tutorial and installed kenlm inside the subfolder kenlm and created the polish.arpa file and updated it.
I get the following error when running ./eval.py --language polish --path_to_ngram polish.arpa I'm running python 3.7 using conda on and ubuntu machine with GPU I manually installed pyctcdecode since it wasn't included in the requirements file.
./eval.py --language polish --path_to_ngram polish.arpa
Traceback (most recent call last):
File "./eval.py", line 8, in <module>
from pyctcdecode import build_ctcdecoder
File "/home/bmoell/miniconda3/envs/wav2vec-nlp/lib/python3.7/site-packages/pyctcdecode/__init__.py", line 3, in <module>
from .decoder import BeamSearchDecoderCTC, build_ctcdecoder # noqa
File "/home/bmoell/miniconda3/envs/wav2vec-nlp/lib/python3.7/site-packages/pyctcdecode/decoder.py", line 26, in <module>
from .language_model import (
File "/home/bmoell/miniconda3/envs/wav2vec-nlp/lib/python3.7/site-packages/pyctcdecode/language_model.py", line 55, in <module>
def _prepare_unigram_set(unigrams: Collection[str], kenlm_model: kenlm.Model) -> Set[str]:
AttributeError: module 'kenlm' has no attribute 'Model'
Ah I think I forgot to write that you also need to install this library here: https://github.com/kpu/kenlm#installation
Can you give the pip install command a try and see whether it works? :-)
In case it works it would be amazing if you could make a quick PR to update the requirements.txt
and the README.md :-)
I installed using pip install https://github.com/kpu/kenlm/archive/master.zip I'm getting a new error.
My guess is that my polish.arpa file is misformed somehow but it's quite tricky to check since it's very slow to load and edit the file.
Here is how the beginning of the file looks. Since the error said it's expecting a tab, i suspect that there might be spaces somewhere where it should be tabs in the file?
\data\
ngram 1=86587
ngram 2=546387
ngram 3=796581
ngram 4=843999
ngram 5=850874
\1-grams:
-5.7532206 <unk> 0
0 <s> -0.06677356
0 </s> -0.06677356
Traceback (most recent call last):
File "./eval.py", line 100, in <module>
main(args)
File "./eval.py", line 37, in main
args.path_to_ngram,
File "/home/bmoell/miniconda3/envs/wav2vec-nlp/lib/python3.7/site-packages/pyctcdecode/decoder.py", line 697, in build_ctcdecoder
kenlm_model = None if kenlm_model_path is None else kenlm.Model(kenlm_model_path)
File "kenlm.pyx", line 142, in kenlm.Model.__init__
OSError: Cannot read model 'polish.arpa' (lm/read_arpa.hh:51 in void lm::Read1Gram(util::FilePiece&, Voc&, Weights*, lm::PositiveProbWarn&) [with Voc = lm::ngram::ProbingVocabulary; Weights = lm::ProbBackoff] threw FormatLoadException because `f.get() != '\t''. Expected tab after probability in the 1-gram at byte 103 Byte: 103)
yeah this looks like an issue with the .arpa
file - a good debugging strategy would be to:
- take less text to create the ngram -> instead of a 5 gram just do a 3 gram -> same trick with
</s>
-> quicker debugging cycle -> find bug -> correct -> apply same to large 5 gram.
Don't think this is related to the code here
@patrickvonplaten I believe the FormatLoadException error occurs when the file is changed (while following the instructions) via editor (vim, nano, even IDEs). They commonly mishandle the '\t', '\n' & other alike indentations.
Easiest way to apply your instructions without breaking the formatting would be:
- Read the arpa via python & copy the 2 lines that needs to be changed.
- Write a new arpa file by changing the lines while handling the formatting correctly like the script shows below.
- Load it & have fun.
original = open('lang.arpa', 'r').readlines()
fixed = open('lang_fixed.arpa', 'w')
for line in original:
if line == 'ngram 1=634704\n':
fixed.write('ngram 1=634705\n')
elif line == '0\t<s>\t-0.07692495\n':
fixed.write('0\t<s>\t-0.07692495\n')
fixed.write('0\t</s>\t-0.07692495\n')
else:
fixed.write(line)
fixed.close()