NeMo-Curator
NeMo-Curator copied to clipboard
Add features from Llama Nemotron tutorial to NeMo Curator modules
trafficstars
While some of the classes in https://github.com/NVIDIA/NeMo-Curator/pull/695 are very specific to the dataset being curated, others could be useful for a variety of datasets and should be added as NeMo Curator modules.
In particular, I think it would be nice to add these as modules: NonEnglishFilter (could be generalized to any language), TokenCountFilter, and CompletionTokenCountFilter.