dl-translate icon indicating copy to clipboard operation
dl-translate copied to clipboard

Support for sentence splitting

Open xhluca opened this issue 4 years ago • 3 comments

Right now TranslationModel.translate will translate each input string as is, which can be extremely slow for longer sequences due to the quadratic runtime of the architecture. The current recommended way is to use nltk:

import nltk

nltk.load("punkt")

text = "Mr. Smith went to his favorite cafe. There, he met his friend Dr. Doe."
sents = nltk.tokenize.sent_tokenize(text, "english")  # don't use dlt.lang.ENGLISH
" ".join(model.translate(sents, source=dlt.lang.ENGLISH, target=dlt.lang.FRENCH))

Which works well but doesn't include all possible languages. It would be interesting to train the punkt model on each of the language made available (though we'd need to use a very large dataset for that). Once that's done, the snippet above could be a simple argument, e.g. model.translate(..., max_length="sentence"). With some more effort, max_length parameter could also be an integer n between 0 and 512, which represents the length of the max token. Moreover, rather than truncating at that length, we could break down the input text into sequences of length n or less, which would include the aggregated sentences.

xhluca avatar Feb 26 '21 23:02 xhluca

stanza might be a good option

xhluca avatar Mar 04 '21 22:03 xhluca

Might be worth training punkt on cc 100 or mC4 (which is the dataset behind mT4)

xhluca avatar Mar 11 '21 17:03 xhluca

What do you think about https://pypi.org/project/sentence-splitter/ ?

fbaeumer avatar Apr 22 '21 07:04 fbaeumer