seamless_communication icon indicating copy to clipboard operation
seamless_communication copied to clipboard

entire sentences are dropped in T2TT

Open MathiasSchindler opened this issue 1 year ago • 2 comments

I tried to use T2TT to translate an entire paragraph (output from whisper.cpp of a debate in a state parliament) and it appears that in deu -> eng translation, the last two sentences were dropped entirely. Are there any debug flags or logfiles that could help narrow down how these sentences became missing in action?

mathias@mathias-b650:~/Landtag$ m4t_predict "Das Polizeigesetz wurde im Jahr 2019 von der damaligen Rot-Roten Koalition umfassend reformiert. Insbesondere unter dem Eindruck der Gefahren des Terrorismus wurden teils Eingriffsbefugnisse sehr weit ins Vorfeld einer tatsächlichen Straftat oder auch nur der konkreten Planung verlegt. Einige der geplanten Verschärfungen, von denen dann nicht alle im tatsächlich beschlossenen Gesetz gelandet sind, führten zu erheblichen Protesten und wurden auch von uns Bündnisgrünen damals scharf kritisiert. Mit dem vorliegenden Bericht wird nun nach einigen Jahren eine erste Einschätzung der praktischen Umsetzung möglich. Dass viele der besonders einschneidenden neuen Befugnisse sehr selten oder sogar gar nicht zum Tragen kamen, halte ich für ein gutes Zeichen." t2tt eng --src_lang deu
2023-09-26 19:40:12,287 INFO -- m4t_scripts.predict.predict: Running inference on the CPU in torch.float32.
Using the cached checkpoint of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached tokenizer of the model 'seamlessM4T_large'. Set `force=True` to download again.
Using the cached checkpoint of the model 'vocoder_36langs'. Set `force=True` to download again.
2023-09-26 19:40:35,856 INFO -- m4t_scripts.predict.predict: Translated text in eng: The Police Act was comprehensively reformed in 2019 by the then Red-Red Coalition. Especially under the impression of the dangers of terrorism, some intervention powers were moved very far in advance of an actual crime or even concrete planning. Some of the planned tightenings, not all of which then ended up in the actually adopted law, led to significant protests and were also sharply criticized by us Greens at the time.

MathiasSchindler avatar Sep 26 '23 17:09 MathiasSchindler

Hello. Got the same issue in another lang pairs. Looks like T2TT can translate only 3-4 sentences at once

nedobylskiy avatar Oct 04 '23 12:10 nedobylskiy

Actually, both Seamless and its text-only predecessor NLLB were trained mostly on single-sentence translation, and they are by no means guaranteed to correctly translate multiple-sentence texts.

Thus, the safest recommendation is to split the input text into sentences, translate them independently from each other, and concatenate the translations into the resulting text.

Here is the code that was used during the text preprocessing for Seamless/NLLB for splitting documents into sentences for various languages: https://github.com/facebookresearch/stopes/blob/main/stopes/pipelines/monolingual/utils/sentence_split.py

avidale avatar Oct 09 '23 13:10 avidale