argos-translate icon indicating copy to clipboard operation
argos-translate copied to clipboard

Use LASER to improve data quality

Open PJ-Finlay opened this issue 4 years ago • 2 comments

  • https://github.com/facebookresearch/LASER

PJ-Finlay avatar May 24 '21 22:05 PJ-Finlay

Creating a new issue to use LASER to improve data quality. LASER generates embedding for sentences in different languages that are semantically consistent between languages. This allows for determining how similar two pieces of parallel data are to remove bad data.

PJ-Finlay avatar May 24 '21 22:05 PJ-Finlay

I use this [SentenceTransformers] to do the same thing with cosine similarity. It'll increase pre-processing times a ton [the speed is slow, even with optimization in the cosine calculation it's 1000ex/s] but it provides a decent filter.

You would use the multilingual models which are listed here

Mentioned it here in the forum

ArtanisTheOne avatar Apr 28 '23 00:04 ArtanisTheOne