students icon indicating copy to clipboard operation
students copied to clipboard

Wikipedia [ modifier | modifier le code ] grossly confusing fr-en model

Open kpu opened this issue 3 years ago • 4 comments

https://fr.wikipedia.org/wiki/Droupadi_Murmu but this is all over French wikipedia. Did the data cleaning remove |?

Source has [ modifier | modifier le code ]: src

Target has a very confused model: tgt

kpu avatar Jul 30 '22 11:07 kpu

This is an issue with para crawl where often 1 sentence on the source would be aligned to about 20 different sentences on the target side, one of which would be the correct translation, and the rest of them being metadata like the one that you see. We ran Fr paracrawl through the deduplicator on both sides and apparently what happened is that it just remembered the first src trg pair that often turned out to be genuine source sentence aligned to crap metadata on the target side and there you have the result.

XapaJIaMnu avatar Jul 30 '22 11:07 XapaJIaMnu

The problem is so prevalent that dedup threw away 35% of Fr en paracrawl

XapaJIaMnu avatar Jul 30 '22 11:07 XapaJIaMnu

The deduper is designed to take the first input and remove the subsequent ones. I think you want the one with highest bicleaner score?

kpu avatar Jul 30 '22 11:07 kpu

Yes, but we were in a pinch ;/. Or translate and compute bleu scores with the synthetic translation. Or anything else but what we did.

Tbh I didn't expect the model to remember those cases so well...

XapaJIaMnu avatar Jul 30 '22 12:07 XapaJIaMnu