NeMo-Curator
NeMo-Curator copied to clipboard
Add a way to pass expected language to FastTextLangId filter
Description
Currently, FastTextLangId filter only supports filtering by a language ID filter, but sometimes, we know what the language the data is supposed to be, and it would be a useful addition to keep only the data that matches with the expected language.
We use two-letter ISO-639 code to denote languages.
Usage
Passing an extra argument when initializing the filter will make it check against expected language, for example:
FastTextLangId(model_path=FAST_TEXT_MODEL_DIR, lang=SRC_LANG)
If lang argument is not passed, it falls back to the old behavior of filtering by minimum language ID score.
bitext_filtering tutorial is updated to demonstrate how this is used in a pipeline.
Checklist
- [x] I am familiar with the Contributing Guide.
- [ ] New or Existing tests cover these changes.
- [x] The documentation is up to date with these changes.
(FastTextLangId filter is currently only tested with a fake emulator class. Not sure how to best cover this change with test.)