recasepunc icon indicating copy to clipboard operation
recasepunc copied to clipboard

Training on Russian

Open Lorenzoncina opened this issue 1 year ago • 5 comments

In order to train a model on Russian dara from Web Crawl, do you suggest a specifc pre-trained bert model?

Lorenzoncina avatar Nov 02 '23 09:11 Lorenzoncina

I don't know about Russian BERTs but what you want to care about is tokeization. In particular the preprocessing stage needs to normalize punctuation in a tokenization-neutral manner.

benob avatar Nov 02 '23 11:11 benob

There are already trained models https://alphacephei.com/vosk/models/vosk-recasepunc-ru-0.22.zip

nshmyrev avatar Nov 02 '23 19:11 nshmyrev

@nshmyrev thanks for your reply. I noticed that to run this models you mentioned there are more dependencies than the one reposrted on this repository. Am I correct?

Lorenzoncina avatar Nov 03 '23 14:11 Lorenzoncina

I'me getting this error when trying to run prediction with this russian model:

python3 ../../recasepunc/recasepunc.py predict checkpoint < ru-test.txt > output.txt
Traceback (most recent call last):
  File "../../recasepunc/recasepunc.py", line 752, in <module>
    main(config, config.action, config.action_args)
  File "../../recasepunc/recasepunc.py", line 723, in main
    generate_predictions(config, *args)
  File "../../recasepunc/recasepunc.py", line 346, in generate_predictions
    loaded = torch.load(checkpoint_path, map_location=config.device if torch.cuda.is_available() else 'cpu')
  File "/raid/data/s2t/speech_tools/recasepunc/lib/python3.8/site-packages/torch/serialization.py", line 607, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/raid/data/s2t/speech_tools/recasepunc/lib/python3.8/site-packages/torch/serialization.py", line 882, in _load
    result = unpickler.load()
  File "/raid/data/s2t/speech_tools/recasepunc/lib/python3.8/site-packages/torch/serialization.py", line 875, in find_class
    return super().find_class(mod_name, name)
AttributeError: Can't get attribute 'Trie' on <module 'transformers.tokenization_utils' from '/raid/data/s2t/speech_tools/recasepunc/lib/python3.8/site-packages/transformers/tokenization_utils.py'>

I'm using an enviroment with all requirements requested here: https://github.com/benob/recasepunc While I can use without problem the english model

Lorenzoncina avatar Nov 06 '23 07:11 Lorenzoncina

Hi. I replied you on https://github.com/alphacep/vosk-api/issues/1459, it needs transformers==4.25.0

nshmyrev avatar Nov 06 '23 15:11 nshmyrev