lindat-translation icon indicating copy to clipboard operation
lindat-translation copied to clipboard

Bad translation when text is in upper-case

Open jirib opened this issue 4 years ago • 7 comments

I often discover that text is incorrectly translated when it is upper-case. See examples:

TOMORROW IS A GENERAL STRIKE BECAUSE THE HEALTH OF WORKERS COMES BEFORE THE PROFITS OF THE BOSSES!
ZÍTRA JE VŠEOBECNÝ STRIKE PRO ZDRAVÍ PRACOVNÍKŮ PŘED PROFILY ČLENSKÝCH STÁTŮ!

vs

tomorrow is a general strike because the health of workers comes before the profits of the bosses!
zítra je generální stávka, protože zdraví zaměstnanců je přednější než zisky šéfů!

jirib avatar Jan 29 '21 08:01 jirib

Thanks for reporting. I confirm we are aware of this.

A simple workaround would be to handle this in the web interface (frontend): In pre-processing, lower-case input lines which are all-upper-case. In post-processing, upper-case the translation of such lines. Even better than lower-casing would be true-casing, but that would require having a true-caser for each language. I think lower-casing is ok (this is just a temporary-workaround anyway). We will just need to turn this workaround on/off independently for each language pair (model). We can turn it on for all source languages except for German. Once a proper solution is ready (see below), we can turn it off for these new models. @kosarko, what do you think?

I plan a more general solution: train a robust translation model which would correctly handle cases, where just a few words are upper-cased (but not the whole sentence), but at the same time distinguish cases where the casing is important, e.g. "Call us" vs "Call US" or "Přišel pes" vs "Přišel PES". However, this will take several months and I won't have time to re-train all the language pairs currently available at Lindat. So the above-mentioned workaround will be still useful.

martinpopel avatar Jan 29 '21 09:01 martinpopel

Another example. I've been quite scared while reading the translation. The translation to English worked fine.

Verkaufe hier oben genannte Hauptelektronik aus einem Bauknecht Herd. Funktioniert einwandfrei. 
Passt in viele verschiedene Geräte. BITTE NUMMER ABGLEICHEN!
Prodej hlavní elektroniku tady nahoře ze staveniště Herd. Funguje to dobře. 
Hodí se k různým přístrojům. ZABIJTE SE ZABIJETE! ZABIJTE SE! ZABIJTE SE!

naro avatar Feb 20 '21 19:02 naro

This is an interesting example. If I remember correctly, we currently do not have a direct German-Czech model, therefore, I am guessing that the text was translated by using English as a pivot (not sure about the details though).

When I tried using the webservice, the pivotal translation looked good (I pivoted manually translating from German-to-English and then pasting the translation to English-to-Czech): "BITTE NUMMER ABGLEICHEN!" -> "PLEASE COMPARE NUMBER!" -> "PROSÍM POČET!"

(The translation using German-to-Czech was producing the same mistranslation as in the above post.)

Are we, by any chance, using a different En-Cs model in the pivotal German-to-Czech translation than in the direct English-to-Czech translation?

varisd avatar Feb 22 '21 13:02 varisd

First, thanks @naro for reporting. While we are aware of the problem in general, it is good to know that there are also such disturbing mis-translation.

@varisd: As noted at the bottom of https://lindat.mff.cuni.cz/services/translation/, there are direct cs<->de models by @Gldkslfmsd. Maybe we should just disable those, and use pivoting via English. Someone should evaluate both ways together with Google Translate on some test set. I receive many complaints about the German translation quality at lindat.cz - maybe we should exclude German completely if it is of a lower quality than Google.

In my comment above, I suggested a simple workaround for all-upper-case sentences. Unfortunately, it does not work for German, where all nouns start with a capital letter: instead of lower-casing the sentence, we would need to true-case it, so we would need to train a true-caser.

martinpopel avatar Feb 22 '21 13:02 martinpopel

Whatabout simple conditinal lowercasing, if the sentence is all uppercase?

BITTE NUMMER ABGLEICHEN! => Bitte nummer abgleichen! => Porovnejte to s číslem, prosím.

Gldkslfmsd avatar Feb 22 '21 14:02 Gldkslfmsd

@Gldkslfmsd This is what I suggested above as the workaround, but it would not work for German if the sentence contains nouns.

martinpopel avatar Feb 22 '21 14:02 martinpopel

have you tried it? Of course it won't be optimal, but it could be better.

Gldkslfmsd avatar Feb 22 '21 14:02 Gldkslfmsd