wtpsplit Incorrect splits

Incorrect splits

Open bminixhofer opened this issue 3 years ago • 5 comments

Please report here issues similar to https://github.com/bminixhofer/nnsplit/issues/18, i. e. text where it is easy for humans to see the correct split but NNSplit gets it wrong.

I'm not entirely satisfied with the quality of the models yet, such cases might help improve it.

Oct 19 '20 07:10 bminixhofer

Hi, could you please take a look at the next split: "let me guess you're the kind of guy that ignores the rules cause it makes you feel in control am i right you're not wrong you think that's cute do you think it's cute"

the text is from random tiktok video: https://www.tiktok.com/foryou?is_copy_url=1&is_from_webapp=v2#/@lisaandlena/video/6922836710988500229

there should be something like: "Let me guess, you're the kind of guy that ignores the rules cause it makes you feel in control. Am i right. You're not wrong. You think that's cute. Do you think it's cute"

but we have "Let me guess you're the kind of guy that ignores the rules cause it makes you feel in control am. I right you're not wrong. You think that's cute do you think it's cute" Why "am" and "I" are splitted is most questionable :)

BTW, don't you think to add some more texts from other sources to model training? Because wikipedia is more about writing or academic language, not everyday speaking language.

Feb 25 '21 12:02 marlon-br

Hi, thanks for reporting this!

As you said, the issue here is likely that the model is just trained on written (mostly academic) language, not on spoken words. For example "am I right" probably barely ever occurs in Wikipedia at the start of a sentence so it makes sense it's not recognized.

texts from other sources

Do you have any specific sources in mind? I considered using text from Opus OpenSubtitles once but the issue there is that the samples are often not really once sentence (from manual inspection). Texts from OPUS could probably be made to work though with some preprocessing.

I'm open to the idea of retraining the models on more diverse sources.

Feb 25 '21 12:02 bminixhofer

i think the best way are social network comments, youtube comments and whatsapp\telegram etc. chats for example: https://www.kaggle.com/dolfik/russian-telegram-chats-history or https://lionbridge.ai/datasets/15-best-chatbot-datasets-for-machine-learning/ etc.

but anyway preprocessing is required

Feb 25 '21 13:02 marlon-br

Hi, thanks for reporting this!

As you said, the issue here is likely that the model is just trained on written (mostly academic) language, not on spoken words. For example "am I right" probably barely ever occurs in Wikipedia at the start of a sentence so it makes sense it's not recognized.

texts from other sources

Do you have any specific sources in mind? I considered using text from Opus OpenSubtitles once but the issue there is that the samples are often not really once sentence (from manual inspection). Texts from OPUS could probably be made to work though with some preprocessing.

I'm open to the idea of retraining the models on more diverse sources.

I recently discovered https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/punctuation_and_capitalization.html from nVidia.

They use the next sources:

The model was trained with Huggingface DistilBERT base uncased checkpoint on a subset of data from the following sources:

Tatoeba sentences
Books from Project Gutenberg that were used as part of the LibriSpeech corpus
Transcripts from Fisher English Training Speech

Output is completely is expected:

"Let me guess, you're the kind of guy that ignores the rules cause it makes you feel in control. Am i right? You're not wrong? You think that's cute? Do you think it's cute?"

May 20 '21 16:05 marlon-br

Hi, thanks, this looks interesting as a starting point to distill further using the nnsplit models (since DistilBERT is still probably too slow).

May 27 '21 09:05 bminixhofer

wtpsplit wtpsplit copied to clipboard

Incorrect splits

wtpsplit
wtpsplit copied to clipboard