firefox-translations-training icon indicating copy to clipboard operation
firefox-translations-training copied to clipboard

Ensure monolingual corpus is de-duplicated from the parallel corpus

Open gregtatum opened this issue 1 year ago • 0 comments

I haven't fully audited the code, but I suspect that the monolingual data is not being deduplicated from the parallel data.

For instance, in the ca-en model, OpenSubtitles was used in the parallel corpus, and it was also included in the monolingual corpus via "Catalan Textual Corpus". Since the parallel corpus is deduplicated by both the source and target translations, the synthesized translation would be kept.

It would be most cost effective to deduplicate the monolingual data before it is used for translations. I don't think the dedupe utility supports this type of thing.

It would be good to have a test for this if it is the case that we already do this.

gregtatum avatar Jan 24 '24 16:01 gregtatum