NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Add support for parallel data curation

Open shuoyangd opened this issue 1 year ago • 0 comments
trafficstars

Description

This PR adds support for parallel data curation. Namely:

  • A new dataset class ParallelDataset that supports loading and writing parallel data in simple bitext format.
  • A new ScoreFilter subclass ParallelScoreFilter that allows application of existing monolingual filters on parallel data while maintaining the alignment of sentence/document pairs.
  • A new ScoreFilter subclass JointScoreFilter that allows implementation of filters that takes both fields of the parallel sentence/document pairs.
  • New heuristic filters: HistogramFilter and LengthRatioFilter.
  • Adding model-based filters with quality estimation models: QualityEstimationFilter.
  • Support for two families of quality estimation models: comet and cometoid.
  • A tutorial for parallel data curation.
  • Tests accompanying new features.

Joint work at MTMA 2024 with @nverma1.

Usage

See tutorials/bitext_cleaning/main.py.

Checklist

  • [x] I am familiar with the Contributing Guide.
  • [x] New or Existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

shuoyangd avatar Aug 08 '24 21:08 shuoyangd