NeMo-Curator
NeMo-Curator copied to clipboard
Scalable data pre processing and curation toolkit for LLMs
Closes https://github.com/NVIDIA/NeMo-Curator/issues/195.
Closes https://github.com/NVIDIA/NeMo-Curator/issues/342.
## Description This PR implements the feature to add skip labels to filtered entries in the json/parquet outputs instead of completely removing filtered entries. When this feature is enabled, it...
Creating a generic `FastTextClassifier` class and a `DCLMFastTextClassifier` class which uses https://huggingface.co/mlfoundations/fasttext-oh-eli5. Implementation from: https://github.com/NVIDIA/NeMo-Curator/pull/536. The Nemotron-CC classifiers from https://github.com/NVIDIA/NeMo-Curator/pull/518 were used in an ensemble with the DCLM FastText classifier,...
Bumps [nltk](https://github.com/nltk/nltk) from 3.8.1 to 3.9. Changelog Sourced from nltk's changelog. Version 3.9.1 2024-08-19 Fixed bug that prevented wordnet from loading Version 3.9 2024-08-18 Fix security vulnerability CVE-2024-39705 (breaking change)...
Please check Issue https://github.com/NVIDIA/NeMo-Curator/issues/411
Import the download_common_crawl function from nemo_curator ## Description The Quick Example is missing the import statement, so it has been added.
## Description ## Usage ```python # Add snippet demonstrating usage ``` ## Checklist - [ ] I am familiar with the [Contributing Guide](https://github.com/NVIDIA/NeMo-Curator/blob/main/CONTRIBUTING.md). - [ ] New or Existing tests...
## Description This PR does a couple of things: - Refactor `ScoreFilter` and `DocumentFilter` to be a single module. - Refactor `Modify` and `DocumentModifier` to be a single module. -...