NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Add features from Llama Nemotron tutorial to NeMo Curator modules

Open sarahyurick opened this issue 6 months ago • 1 comments
trafficstars

While some of the classes in https://github.com/NVIDIA/NeMo-Curator/pull/695 are very specific to the dataset being curated, others could be useful for a variety of datasets and should be added as NeMo Curator modules.

In particular, I think it would be nice to add these as modules: NonEnglishFilter (could be generalized to any language), TokenCountFilter, and CompletionTokenCountFilter.

sarahyurick avatar May 12 '25 17:05 sarahyurick