NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Add a way to pass expected language to FastTextLangId filter

Open shuoyangd opened this issue 9 months ago • 0 comments

Description

Currently, FastTextLangId filter only supports filtering by a language ID filter, but sometimes, we know what the language the data is supposed to be, and it would be a useful addition to keep only the data that matches with the expected language.

We use two-letter ISO-639 code to denote languages.

Usage

Passing an extra argument when initializing the filter will make it check against expected language, for example:

FastTextLangId(model_path=FAST_TEXT_MODEL_DIR, lang=SRC_LANG)

If lang argument is not passed, it falls back to the old behavior of filtering by minimum language ID score.

bitext_filtering tutorial is updated to demonstrate how this is used in a pipeline.

Checklist

  • [x] I am familiar with the Contributing Guide.
  • [ ] New or Existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

(FastTextLangId filter is currently only tested with a fake emulator class. Not sure how to best cover this change with test.)

shuoyangd avatar Feb 21 '25 22:02 shuoyangd