recasepunc
recasepunc copied to clipboard
Training on Russian
In order to train a model on Russian dara from Web Crawl, do you suggest a specifc pre-trained bert model?
I don't know about Russian BERTs but what you want to care about is tokeization. In particular the preprocessing stage needs to normalize punctuation in a tokenization-neutral manner.
There are already trained models https://alphacephei.com/vosk/models/vosk-recasepunc-ru-0.22.zip
@nshmyrev thanks for your reply. I noticed that to run this models you mentioned there are more dependencies than the one reposrted on this repository. Am I correct?
I'me getting this error when trying to run prediction with this russian model:
python3 ../../recasepunc/recasepunc.py predict checkpoint < ru-test.txt > output.txt
Traceback (most recent call last):
File "../../recasepunc/recasepunc.py", line 752, in <module>
main(config, config.action, config.action_args)
File "../../recasepunc/recasepunc.py", line 723, in main
generate_predictions(config, *args)
File "../../recasepunc/recasepunc.py", line 346, in generate_predictions
loaded = torch.load(checkpoint_path, map_location=config.device if torch.cuda.is_available() else 'cpu')
File "/raid/data/s2t/speech_tools/recasepunc/lib/python3.8/site-packages/torch/serialization.py", line 607, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/raid/data/s2t/speech_tools/recasepunc/lib/python3.8/site-packages/torch/serialization.py", line 882, in _load
result = unpickler.load()
File "/raid/data/s2t/speech_tools/recasepunc/lib/python3.8/site-packages/torch/serialization.py", line 875, in find_class
return super().find_class(mod_name, name)
AttributeError: Can't get attribute 'Trie' on <module 'transformers.tokenization_utils' from '/raid/data/s2t/speech_tools/recasepunc/lib/python3.8/site-packages/transformers/tokenization_utils.py'>
I'm using an enviroment with all requirements requested here: https://github.com/benob/recasepunc While I can use without problem the english model
Hi. I replied you on https://github.com/alphacep/vosk-api/issues/1459, it needs transformers==4.25.0