NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Create FastText classifier module

Open sarahyurick opened this issue 9 months ago • 0 comments
trafficstars

Creating a generic FastTextClassifier class and a DCLMFastTextClassifier class which uses https://huggingface.co/mlfoundations/fasttext-oh-eli5.

Implementation from: https://github.com/NVIDIA/NeMo-Curator/pull/536.

The Nemotron-CC classifiers from https://github.com/NVIDIA/NeMo-Curator/pull/518 were used in an ensemble with the DCLM FastText classifier, which is why I have created the module in this PR.

Unfortunately, this module is CPU-only, so I am keeping it as a draft for now. Ideally, we can try to accelerate it with CrossFit and include it among our suite of DistributedDataClassifier models. If not, this could still be nice to have in favor of the FastTextQualityFilter, which automatically filters by quality.

sarahyurick avatar Feb 13 '25 00:02 sarahyurick