wtpsplit
wtpsplit copied to clipboard
Language wishlist
A list of languages currently considered for training and adding to the Repo:
- [x] Swedish
- [x] Norwegian
- [x] French
- [x] Turkish
- [x] Simplified Chinese
- [x] Russian
- [x] Ukrainian
- [ ] Catalan
- [ ] Dutch
- [ ] Farsi
- [ ] Italian
- [ ] Portugese
- [ ] Spanish
- [ ] Vietnamese
- [ ] Traditional Chinese
I'll see if I can train models for languages on this list. If you want to speed it up, just train it yourself following https://github.com/bminixhofer/nnsplit/blob/master/train/train.ipynb :)
Hello! Can you add French to the list?
Thanks :)
Can you please Add Turkish ?
Sure!
Sure!
THanks, will you add it today or later ?
I added it to the list. I can't promise when I'll get around to training it.
drive-download-20200912T101135Z-001.zip I have trained them on your training code, can you see and update today please ?
or can you tell me how can I use the model on these code
def recognize_speech(wav_path, lang="en", buffer_size=4000):
download_model(lang)
vosk.SetLogLevel(-1)
wav_file = wave.open(wav_path, "rb")
recognizer = vosk.KaldiRecognizer(
vosk.Model("{}/{}".format(get_model_path(), lang)),
wav_file.getframerate())
words = []
for index in tqdm(range(0, wav_file.getnframes(), buffer_size)):
frames = wav_file.readframes(buffer_size)
if recognizer.AcceptWaveform(frames):
result = json.loads(recognizer.Result())
if len(result["text"]) > 0:
for token in result["result"]:
words.append({
"start": token["start"],
"end": token["end"],
"text": token["word"],
})
print(words)
return words
Can you please Add Turkish ?
Hey good match , I was looking that very long time ....
Thanks for training the model! Your torchscript exported model was somehow broken.. Which pytorch version do you use?
I managed to recover it using the weights from the ONNX graph. This should work: torchscript_cpu_model.zip
(rename it from .zip to .pt, I'm too lazy to upload it externally right now)
Load it like this:
import nnsplit
splitter = nnsplit.NNSplit(torch.jit.load("torchscript_cpu_model.pt"), "cpu")
for sentence in splitter.split(["Bu bir cümle Bu ikinci bir cümle."])[0]:
print(str(sentence))
which prints:
Bu bir cümle
Bu ikinci bir cümle.
(sorry if this is broken turkish)
I'll properly add it to the repository later!
Also, I'm happy to see some more interest in the library now, I'll move forward with some changes I had thought about (configurable speed / accuracy tradeoff at inference, more robust training).
Thanks for training the model! Your torchscript exported model was somehow broken.. Which pytorch version do you use?
I managed to recover it using the weights from the ONNX graph. This should work: torchscript_cpu_model.zip
(rename it from .zip to .pt, I'm too lazy to upload it externally right now)
Load it like this:
import nnsplit splitter = nnsplit.NNSplit(torch.jit.load("torchscript_cpu_model.pt"), "cpu") for sentence in splitter.split(["Bu bir cümle Bu ikinci bir cümle."])[0]: print(str(sentence))
which prints:
Bu bir cümle Bu ikinci bir cümle.
(sorry if this is broken turkish)
I'll properly add it to the repository later!
Also, I'm happy to see some more interest in the library now, I'll move forward with some changes I had thought about (configurable speed / accuracy tradeoff at inference, more robust training).
Thanks a ton. Great work.
Could you please also add Simplified Chinese? Thanks a lot.
Very interested in french model as well !
I was a bit busy lately. I'm now working on training and evaluating all models currently in the list (Norwegian, French, Swedish, Turkish, Simplified Chinese). Will be done tomorrow.
I also improved the model a bit, it's now faster and more accurate through a downsampling trick (downsample -> LSTM -> upsample) so I'm retraining English and German as well.
I trained all the models and released them as Release 0.5.0.
You can now do:
import nnsplit
print(nnsplit.__version__) # should be 0.5.0-post0
# english
nnsplit.NNSplit.load("en")
# german
nnsplit.NNSplit.load("de")
# turkish
nnsplit.NNSplit.load("tr")
# french
nnsplit.NNSplit.load("fr")
# norwegian
nnsplit.NNSplit.load("no")
# swedish
nnsplit.NNSplit.load("sv")
# chinese
nnsplit.NNSplit.load("zh")
Training went well, metrics are in the README.
I'll have to retrain the chinese model though: Chinese punctuation (e. g. 。
) is not in string.punctuation
so it wasn't getting removed.
Also, as I mentioned, I made some improvements to the model architecture so it's quite a bit more accurate now.
There's also now #20 as a tracking issue for problems with these models.
@bminixhofer Awesome! This is super helpful, thank you for putting effort into helping random people on the internet! :D
You're welcome!
@aguang-xyz As of 0.5.2 I retrained the Chinese model with fixed punctuation removal. It should now work properly for text without punctuation. Metrics are still not very good but consistently better than Spacy.
Hi, could you please add Russian?
And for Ukrainian if this is not so hard. Anyway, I will try to build a model by myself using information from the notebook.
Hi, sure!
I added them to the list. I'll give training them a go as well starting with Russian.
@egorsmkv I noticed there is some code hardcoded to use a compound splitter at the moment, it needs some small changes in model.py
to remove that. I'll fix it so you can train a model.
The train.ipynb
notebook is up to date now and the compound splitter issue fixed so training a model should work now.
I trained a model for Russian already and it looks good, I'll train another one for a bit longer and a Ukrainian model over night.
Russian and Ukrainian are now trained & integrated in the Repo. Would be great if you could do a quick sanity check @egorsmkv @marlon-br i.e. check if they split text without errors correctly, don't split on abbreviations and split text with some missing punctuation and case correctly using the demo:
https://bminixhofer.github.io/nnsplit/#demo
since I speak neither of these languages. Metrics also look good:
https://bminixhofer.github.io/nnsplit/#metrics
@bminixhofer I tried Russian sample text. After I removed one comma (between воплощение and построенная) it started to split the sentence on two sentences. In this case second sentence doesn't make sence because it is totally dependent on the first part of the original sentence.
Thanks for checking! There seems to have been a problem with ,
also being removed as punctuation during training. This might also impact some other languages, I'm retraining the affected models.
@bminixhofer I think that it would be a great synergy if you add all languages that are supported by vosk: https://alphacephei.com/vosk/models A lot of people would be very interested to get sentence boundaries etc. for the texts from ASR
Just tested with Ukrainian sentences, looks really good! Thank you, Ben!
Great. I'm retraining the models for 10 epochs to match the other models and fixed the punctuation issue, release will be latest on Monday.
@bminixhofer did you change something in the code or training script to fix the punctuation? I started to train Russian model by my own. Would like to understand if I should fix something too
Yes, I just pushed the commit. The nnsplit training procedure is just:
- split sentences using SpaCy from a corpus which is assumed to not contain any punctuation errors
- with some probability, remove punctuation at the end of the sentences
- train a sequence labelling model, where for each byte the label is 1 if it is the last byte in the current sentence, zero otherwise
The problem was that here:
with some probability, remove punctuation at the end of the sentences
all chars in string.punctuation
were considered punctuation. Now, this is configurable with an argument to the SpacySentenceTokenizer
. I use .?!
for Russian.
The underlying problem is that SpaCy makes some mistakes e. g. splits after a comma in some cases. This is not solved by the update but not removing commas should be an improvement.
@bminixhofer I see, thanks!
How many time does it generally takes to train for 10 epochs?
The bottleneck is often the SpaCy sentencizer. On my machine with an RTX 2080TI it takes ~ 2 hours for Ukrainian and ~ 10 hours for Russian.
Also, I should've said train with 10M samples. One epoch in the train.ipynb
is set to use 500k samples and set to 1M by default in the Python scripts. SpaCy leaks memory when running in parallel across multiple cores. This is reset after each epoch. So you have to set the samples per epoch to something you don't run out of memory with :)
@bminixhofer I run Russian language model training on Google Colab Pro with V100 and it takes 2 hours for one epoch. So I expect it will take about 20 hours for 10 epochs. And this is 2 times more than you expect to have in your setup. But V100 is faster than 2080, so I wonder why it is longer
It could very well be CPU bound. I use an i5 8600k.
The bottleneck is often the SpaCy sentencizer.
Ukrainian and Russian updated models are now in the Repo. Both are significantly better now but in Russian there is still the same issue with the comma in the example text, I don't think there is anything I can do about that.
@marlon-br If you're training models you might be interested in https://wandb.ai/bminixhofer/nnsplit where I track the experiments e. g. https://wandb.ai/bminixhofer/nnsplit/runs/3poigs9a is the latest Russian run.
@bminixhofer AFAIK commas in Russian (and similar languages) are tricky even for native speakers. They have much more importance and sence than in other languages:)
This would be great if you could add next languages: Catalan, Dutch, Farsi, Italian, Portugese, Spanish and Vietnamese Thanks in advance!
Sure, I added them to the list. I am currently focusing on nlprule so this may take some time, I appreciate PRs :) Ideally models should be trained on 10M samples but less is ok too.
@bminixhofer nlprule looks very interesting because you know, I use sentences splitting after ASR and since ASR is not perfect plus speaking language differs from language from Wikipedia sentence boundaries detection is also not perfect. For example the text after ASR looks like this: "hey guys i'm gabby wallace and this is a go natural english lesson i got a great question from a viewer about pronunciation you know one of the most difficult sounds an english but also one of the most common sounds is that are sound and i love teaching the sound because it kind of sounds funny i was think it sounds like a pirate right or can you imagine me with a little pirate high in a hook yeah or maybe well that's exactly what it is it's a pirate sound that's what i call it anyway so we're we're to work on our pirates sounds today one particular word the question that my view or us was how do you say gee i r l girl girl is a really common word right woman girl girls a young woman okay so this is a very common word we need to know how to say it especially if you are a girl you need to be able to say i'm a girl or hey girls only girls club i don't know when i was a teenager or not a teenager maybe more a kid we used to have girls only clubs okay anyway i'm getting off the point year let's talk about pronunciation"
I think nlprule could improve the results a bit :)
Could you please also add Simplified Chinese? Thanks a lot.
will also for Traditional Chinese?
Hi, feel free to keep requests coming here (so I know what to prioritize when I circle back to this library). However, I am currently not training any new models. You can train models yourself here: https://github.com/bminixhofer/nnsplit/blob/main/train/train.ipynb.