dl-translate Support for sentence splitting

Support for sentence splitting

Open xhluca opened this issue 4 years ago • 3 comments

Right now TranslationModel.translate will translate each input string as is, which can be extremely slow for longer sequences due to the quadratic runtime of the architecture. The current recommended way is to use nltk:

import nltk

nltk.load("punkt")

text = "Mr. Smith went to his favorite cafe. There, he met his friend Dr. Doe."
sents = nltk.tokenize.sent_tokenize(text, "english")  # don't use dlt.lang.ENGLISH
" ".join(model.translate(sents, source=dlt.lang.ENGLISH, target=dlt.lang.FRENCH))

Which works well but doesn't include all possible languages. It would be interesting to train the punkt model on each of the language made available (though we'd need to use a very large dataset for that). Once that's done, the snippet above could be a simple argument, e.g. model.translate(..., max_length="sentence"). With some more effort, max_length parameter could also be an integer n between 0 and 512, which represents the length of the max token. Moreover, rather than truncating at that length, we could break down the input text into sequences of length n or less, which would include the aggregated sentences.

Feb 26 '21 23:02 xhluca

stanza might be a good option

Mar 04 '21 22:03 xhluca

Might be worth training punkt on cc 100 or mC4 (which is the dataset behind mT4)

Mar 11 '21 17:03 xhluca

What do you think about https://pypi.org/project/sentence-splitter/ ?

Apr 22 '21 07:04 fbaeumer

dl-translate dl-translate copied to clipboard

Support for sentence splitting

dl-translate
dl-translate copied to clipboard