NeMo-Curator
NeMo-Curator copied to clipboard
Add option to skip data by adding a flag instead of removing them
Description
This PR implements the feature to add skip labels to filtered entries in the json/parquet outputs instead of completely removing filtered entries. When this feature is enabled, it will also log which filter discarded an entry by adding its class name to a field ("reason" by default).
This allows easy tracking/book-keeping in some scenarios, for example:
- When there is another modality (e.g. speech) and the data files are not self-contained
- When someone is experimenting with some new filters and need to know how much entries each filter throw out
- When someone is running other pipelines outside Nemo-Curator
The feature can be applied to all filters without extra code change.
Despite all the entries being preserved in the dataset, we ensure when filters are chained in the form of g(f(x)), g will still only be ran on entries that's not filtered out by f.
Usage
Simply adding an extra flag add_skip_label_only=True to any filter definition. For example:
LengthRatioFilter(
max_ratio=2,
src_lang=SRC_LANG,
tgt_lang=TGT_LANG,
score_field="length_ratio",
score_type=float,
add_skip_label_only=True,
)
This feature won't work with the plain bitext output format because extra flags can't be added there. Make sure json or parquet format is used.
A working example is in the updated tutorials/bitext_cleaning.
Checklist
- [x] I am familiar with the Contributing Guide.
- [ ] New or Existing tests cover these changes.
- [ ] The documentation is up to date with these changes.
(Consider this as an initial draft. I'll write tests and docs if this is deemed merge-worthy.)