wtpsplit icon indicating copy to clipboard operation
wtpsplit copied to clipboard

Language wishlist

Open bminixhofer opened this issue 3 years ago • 40 comments

A list of languages currently considered for training and adding to the Repo:

  • [x] Swedish
  • [x] Norwegian
  • [x] French
  • [x] Turkish
  • [x] Simplified Chinese
  • [x] Russian
  • [x] Ukrainian
  • [ ] Catalan
  • [ ] Dutch
  • [ ] Farsi
  • [ ] Italian
  • [ ] Portugese
  • [ ] Spanish
  • [ ] Vietnamese
  • [ ] Traditional Chinese

I'll see if I can train models for languages on this list. If you want to speed it up, just train it yourself following https://github.com/bminixhofer/nnsplit/blob/master/train/train.ipynb :)

bminixhofer avatar Sep 06 '20 20:09 bminixhofer

Hello! Can you add French to the list?

Thanks :)

adrien-jacquot avatar Sep 07 '20 07:09 adrien-jacquot

Can you please Add Turkish ?

SutirthaChakraborty avatar Sep 12 '20 01:09 SutirthaChakraborty

Sure!

bminixhofer avatar Sep 12 '20 07:09 bminixhofer

Sure!

THanks, will you add it today or later ?

SutirthaChakraborty avatar Sep 12 '20 09:09 SutirthaChakraborty

I added it to the list. I can't promise when I'll get around to training it.

bminixhofer avatar Sep 12 '20 10:09 bminixhofer

drive-download-20200912T101135Z-001.zip I have trained them on your training code, can you see and update today please ?

or can you tell me how can I use the model on these code

def recognize_speech(wav_path, lang="en", buffer_size=4000):

download_model(lang)

vosk.SetLogLevel(-1)

wav_file = wave.open(wav_path, "rb")

recognizer = vosk.KaldiRecognizer(
    vosk.Model("{}/{}".format(get_model_path(), lang)),
    wav_file.getframerate())

words = []

for index in tqdm(range(0, wav_file.getnframes(), buffer_size)):

    frames = wav_file.readframes(buffer_size)

    if recognizer.AcceptWaveform(frames):

        result = json.loads(recognizer.Result())

        if len(result["text"]) > 0:

            for token in result["result"]:
                words.append({
                    "start": token["start"],
                    "end": token["end"],
                    "text": token["word"],
                })
print(words)

return words

SutirthaChakraborty avatar Sep 12 '20 10:09 SutirthaChakraborty

Can you please Add Turkish ?

Hey good match , I was looking that very long time ....

senemaktas avatar Sep 12 '20 12:09 senemaktas

Thanks for training the model! Your torchscript exported model was somehow broken.. Which pytorch version do you use?

I managed to recover it using the weights from the ONNX graph. This should work: torchscript_cpu_model.zip

(rename it from .zip to .pt, I'm too lazy to upload it externally right now)

Load it like this:

import nnsplit

splitter = nnsplit.NNSplit(torch.jit.load("torchscript_cpu_model.pt"), "cpu")
for sentence in splitter.split(["Bu bir cümle Bu ikinci bir cümle."])[0]:
    print(str(sentence))

which prints:

Bu bir cümle
Bu ikinci bir cümle.

(sorry if this is broken turkish)

I'll properly add it to the repository later!

Also, I'm happy to see some more interest in the library now, I'll move forward with some changes I had thought about (configurable speed / accuracy tradeoff at inference, more robust training).

bminixhofer avatar Sep 12 '20 13:09 bminixhofer

Thanks for training the model! Your torchscript exported model was somehow broken.. Which pytorch version do you use?

I managed to recover it using the weights from the ONNX graph. This should work: torchscript_cpu_model.zip

(rename it from .zip to .pt, I'm too lazy to upload it externally right now)

Load it like this:

import nnsplit

splitter = nnsplit.NNSplit(torch.jit.load("torchscript_cpu_model.pt"), "cpu")
for sentence in splitter.split(["Bu bir cümle Bu ikinci bir cümle."])[0]:
    print(str(sentence))

which prints:

Bu bir cümle
Bu ikinci bir cümle.

(sorry if this is broken turkish)

I'll properly add it to the repository later!

Also, I'm happy to see some more interest in the library now, I'll move forward with some changes I had thought about (configurable speed / accuracy tradeoff at inference, more robust training).

Thanks a ton. Great work.

SutirthaChakraborty avatar Sep 12 '20 13:09 SutirthaChakraborty

Could you please also add Simplified Chinese? Thanks a lot.

aguang-xyz avatar Sep 13 '20 14:09 aguang-xyz

Very interested in french model as well !

dmenig avatar Oct 16 '20 20:10 dmenig

I was a bit busy lately. I'm now working on training and evaluating all models currently in the list (Norwegian, French, Swedish, Turkish, Simplified Chinese). Will be done tomorrow.

I also improved the model a bit, it's now faster and more accurate through a downsampling trick (downsample -> LSTM -> upsample) so I'm retraining English and German as well.

bminixhofer avatar Oct 17 '20 11:10 bminixhofer

I trained all the models and released them as Release 0.5.0.

You can now do:

import nnsplit

print(nnsplit.__version__) # should be 0.5.0-post0

# english
nnsplit.NNSplit.load("en")
# german
nnsplit.NNSplit.load("de")
# turkish
nnsplit.NNSplit.load("tr")
# french
nnsplit.NNSplit.load("fr")
# norwegian
nnsplit.NNSplit.load("no")
# swedish
nnsplit.NNSplit.load("sv")
# chinese
nnsplit.NNSplit.load("zh")

Training went well, metrics are in the README. I'll have to retrain the chinese model though: Chinese punctuation (e. g. ) is not in string.punctuation so it wasn't getting removed.

Also, as I mentioned, I made some improvements to the model architecture so it's quite a bit more accurate now.

There's also now #20 as a tracking issue for problems with these models.

bminixhofer avatar Oct 19 '20 07:10 bminixhofer

@bminixhofer Awesome! This is super helpful, thank you for putting effort into helping random people on the internet! :D

EmilStenstrom avatar Oct 19 '20 08:10 EmilStenstrom

You're welcome!

bminixhofer avatar Oct 21 '20 07:10 bminixhofer

@aguang-xyz As of 0.5.2 I retrained the Chinese model with fixed punctuation removal. It should now work properly for text without punctuation. Metrics are still not very good but consistently better than Spacy.

bminixhofer avatar Nov 01 '20 11:11 bminixhofer

Hi, could you please add Russian?

marlon-br avatar Feb 03 '21 11:02 marlon-br

And for Ukrainian if this is not so hard. Anyway, I will try to build a model by myself using information from the notebook.

egorsmkv avatar Feb 03 '21 11:02 egorsmkv

Hi, sure!

I added them to the list. I'll give training them a go as well starting with Russian.

bminixhofer avatar Feb 03 '21 11:02 bminixhofer

@egorsmkv I noticed there is some code hardcoded to use a compound splitter at the moment, it needs some small changes in model.py to remove that. I'll fix it so you can train a model.

bminixhofer avatar Feb 03 '21 12:02 bminixhofer

The train.ipynb notebook is up to date now and the compound splitter issue fixed so training a model should work now.

I trained a model for Russian already and it looks good, I'll train another one for a bit longer and a Ukrainian model over night.

bminixhofer avatar Feb 03 '21 20:02 bminixhofer

Russian and Ukrainian are now trained & integrated in the Repo. Would be great if you could do a quick sanity check @egorsmkv @marlon-br i.e. check if they split text without errors correctly, don't split on abbreviations and split text with some missing punctuation and case correctly using the demo:

https://bminixhofer.github.io/nnsplit/#demo

since I speak neither of these languages. Metrics also look good:

https://bminixhofer.github.io/nnsplit/#metrics

bminixhofer avatar Feb 04 '21 14:02 bminixhofer

@bminixhofer I tried Russian sample text. After I removed one comma (between воплощение and построенная) it started to split the sentence on two sentences. In this case second sentence doesn't make sence because it is totally dependent on the first part of the original sentence.

marlon-br avatar Feb 04 '21 14:02 marlon-br

Thanks for checking! There seems to have been a problem with , also being removed as punctuation during training. This might also impact some other languages, I'm retraining the affected models.

bminixhofer avatar Feb 04 '21 18:02 bminixhofer

@bminixhofer I think that it would be a great synergy if you add all languages that are supported by vosk: https://alphacephei.com/vosk/models A lot of people would be very interested to get sentence boundaries etc. for the texts from ASR

marlon-br avatar Feb 04 '21 21:02 marlon-br

Just tested with Ukrainian sentences, looks really good! Thank you, Ben!

egorsmkv avatar Feb 05 '21 09:02 egorsmkv

Great. I'm retraining the models for 10 epochs to match the other models and fixed the punctuation issue, release will be latest on Monday.

bminixhofer avatar Feb 05 '21 09:02 bminixhofer

@bminixhofer did you change something in the code or training script to fix the punctuation? I started to train Russian model by my own. Would like to understand if I should fix something too

marlon-br avatar Feb 05 '21 10:02 marlon-br

Yes, I just pushed the commit. The nnsplit training procedure is just:

  • split sentences using SpaCy from a corpus which is assumed to not contain any punctuation errors
  • with some probability, remove punctuation at the end of the sentences
  • train a sequence labelling model, where for each byte the label is 1 if it is the last byte in the current sentence, zero otherwise

The problem was that here:

with some probability, remove punctuation at the end of the sentences

all chars in string.punctuation were considered punctuation. Now, this is configurable with an argument to the SpacySentenceTokenizer. I use .?! for Russian.

The underlying problem is that SpaCy makes some mistakes e. g. splits after a comma in some cases. This is not solved by the update but not removing commas should be an improvement.

bminixhofer avatar Feb 05 '21 10:02 bminixhofer

@bminixhofer I see, thanks!

How many time does it generally takes to train for 10 epochs?

marlon-br avatar Feb 05 '21 10:02 marlon-br

The bottleneck is often the SpaCy sentencizer. On my machine with an RTX 2080TI it takes ~ 2 hours for Ukrainian and ~ 10 hours for Russian.

Also, I should've said train with 10M samples. One epoch in the train.ipynb is set to use 500k samples and set to 1M by default in the Python scripts. SpaCy leaks memory when running in parallel across multiple cores. This is reset after each epoch. So you have to set the samples per epoch to something you don't run out of memory with :)

bminixhofer avatar Feb 05 '21 10:02 bminixhofer

@bminixhofer I run Russian language model training on Google Colab Pro with V100 and it takes 2 hours for one epoch. So I expect it will take about 20 hours for 10 epochs. And this is 2 times more than you expect to have in your setup. But V100 is faster than 2080, so I wonder why it is longer

marlon-br avatar Feb 05 '21 20:02 marlon-br

It could very well be CPU bound. I use an i5 8600k.

The bottleneck is often the SpaCy sentencizer.

bminixhofer avatar Feb 05 '21 21:02 bminixhofer

Ukrainian and Russian updated models are now in the Repo. Both are significantly better now but in Russian there is still the same issue with the comma in the example text, I don't think there is anything I can do about that.

@marlon-br If you're training models you might be interested in https://wandb.ai/bminixhofer/nnsplit where I track the experiments e. g. https://wandb.ai/bminixhofer/nnsplit/runs/3poigs9a is the latest Russian run.

bminixhofer avatar Feb 06 '21 10:02 bminixhofer

@bminixhofer AFAIK commas in Russian (and similar languages) are tricky even for native speakers. They have much more importance and sence than in other languages:)

marlon-br avatar Feb 06 '21 11:02 marlon-br

This would be great if you could add next languages: Catalan, Dutch, Farsi, Italian, Portugese, Spanish and Vietnamese Thanks in advance!

marlon-br avatar Feb 09 '21 13:02 marlon-br

Sure, I added them to the list. I am currently focusing on nlprule so this may take some time, I appreciate PRs :) Ideally models should be trained on 10M samples but less is ok too.

bminixhofer avatar Feb 09 '21 18:02 bminixhofer

@bminixhofer nlprule looks very interesting because you know, I use sentences splitting after ASR and since ASR is not perfect plus speaking language differs from language from Wikipedia sentence boundaries detection is also not perfect. For example the text after ASR looks like this: "hey guys i'm gabby wallace and this is a go natural english lesson i got a great question from a viewer about pronunciation you know one of the most difficult sounds an english but also one of the most common sounds is that are sound and i love teaching the sound because it kind of sounds funny i was think it sounds like a pirate right or can you imagine me with a little pirate high in a hook yeah or maybe well that's exactly what it is it's a pirate sound that's what i call it anyway so we're we're to work on our pirates sounds today one particular word the question that my view or us was how do you say gee i r l girl girl is a really common word right woman girl girls a young woman okay so this is a very common word we need to know how to say it especially if you are a girl you need to be able to say i'm a girl or hey girls only girls club i don't know when i was a teenager or not a teenager maybe more a kid we used to have girls only clubs okay anyway i'm getting off the point year let's talk about pronunciation"

I think nlprule could improve the results a bit :)

marlon-br avatar Feb 10 '21 09:02 marlon-br

Could you please also add Simplified Chinese? Thanks a lot.

will also for Traditional Chinese?

conanchen avatar Nov 30 '21 01:11 conanchen

Hi, feel free to keep requests coming here (so I know what to prioritize when I circle back to this library). However, I am currently not training any new models. You can train models yourself here: https://github.com/bminixhofer/nnsplit/blob/main/train/train.ipynb.

bminixhofer avatar Jan 18 '22 13:01 bminixhofer