Wikipedia [ modifier | modifier le code ] grossly confusing fr-en model
https://fr.wikipedia.org/wiki/Droupadi_Murmu but this is all over French wikipedia. Did the data cleaning remove |?
Source has [ modifier | modifier le code ]:

Target has a very confused model:

This is an issue with para crawl where often 1 sentence on the source would be aligned to about 20 different sentences on the target side, one of which would be the correct translation, and the rest of them being metadata like the one that you see. We ran Fr paracrawl through the deduplicator on both sides and apparently what happened is that it just remembered the first src trg pair that often turned out to be genuine source sentence aligned to crap metadata on the target side and there you have the result.
The problem is so prevalent that dedup threw away 35% of Fr en paracrawl
The deduper is designed to take the first input and remove the subsequent ones. I think you want the one with highest bicleaner score?
Yes, but we were in a pinch ;/. Or translate and compute bleu scores with the synthetic translation. Or anything else but what we did.
Tbh I didn't expect the model to remember those cases so well...