NeMo-Curator
NeMo-Curator copied to clipboard
Add support for parallel data curation
trafficstars
Description
This PR adds support for parallel data curation. Namely:
- A new dataset class
ParallelDatasetthat supports loading and writing parallel data in simple bitext format. - A new
ScoreFiltersubclassParallelScoreFilterthat allows application of existing monolingual filters on parallel data while maintaining the alignment of sentence/document pairs. - A new
ScoreFiltersubclassJointScoreFilterthat allows implementation of filters that takes both fields of the parallel sentence/document pairs. - New heuristic filters:
HistogramFilterandLengthRatioFilter. - Adding model-based filters with quality estimation models:
QualityEstimationFilter. - Support for two families of quality estimation models:
cometandcometoid. - A tutorial for parallel data curation.
- Tests accompanying new features.
Joint work at MTMA 2024 with @nverma1.
Usage
See tutorials/bitext_cleaning/main.py.
Checklist
- [x] I am familiar with the Contributing Guide.
- [x] New or Existing tests cover these changes.
- [x] The documentation is up to date with these changes.