NeMo-Curator
NeMo-Curator copied to clipboard
Create FastText classifier module
Creating a generic FastTextClassifier class and a DCLMFastTextClassifier class which uses https://huggingface.co/mlfoundations/fasttext-oh-eli5.
Implementation from: https://github.com/NVIDIA/NeMo-Curator/pull/536.
The Nemotron-CC classifiers from https://github.com/NVIDIA/NeMo-Curator/pull/518 were used in an ensemble with the DCLM FastText classifier, which is why I have created the module in this PR.
Unfortunately, this module is CPU-only, so I am keeping it as a draft for now. Ideally, we can try to accelerate it with CrossFit and include it among our suite of DistributedDataClassifier models. If not, this could still be nice to have in favor of the FastTextQualityFilter, which automatically filters by quality.