NeMo-Curator
NeMo-Curator copied to clipboard
Scalable data pre processing and curation toolkit for LLMs
While some of the classes in https://github.com/NVIDIA/NeMo-Curator/pull/695 are very specific to the dataset being curated, others could be useful for a variety of datasets and should be added as NeMo...
Currently, a user wanting to implement non-English multilingual PII redaction must edit our source code. This is not ideal. The solution should be pretty simple, see: https://github.com/NVIDIA/NeMo-Curator/blob/c909eb3ec43aba0472e655cc8ab473a48543ea11/docs/user-guide/personalidentifiableinformationidentificationandremoval.rst#multilingual-pii-redaction
**Is your feature request related to a problem? Please describe.** We have a few variables in _compat.py (possibly codepaths too that use these variables) that are for older version of...
Loosely modeled after the NeMo setup: - https://github.com/NVIDIA/NeMo/tree/main/tests/functional_tests - https://github.com/NVIDIA/NeMo/blob/main/.github/workflows/cicd-main-e2e-tests.yml TODO: - [x] [aegis_classifier_inference](https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/scripts/classifiers/aegis_classifier_inference.py) - [x] [content_type_classifier_inference](https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/scripts/classifiers/content_type_classifier_inference.py) - [x] [domain_classifier_inference](https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/scripts/classifiers/domain_classifier_inference.py) - [x] [fineweb_edu_classifier_inference](https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/scripts/classifiers/fineweb_edu_classifier_inference.py) - [x] [fineweb_mixtral_edu_classifier_inference](https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/scripts/classifiers/fineweb_mixtral_edu_classifier_inference.py) - [x] [fineweb_nemotron_edu_classifier_inference](https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/scripts/classifiers/fineweb_nemotron_edu_classifier_inference.py) -...
## Description Without this that one single cluster will have datatype of int32 vs float32 for other columns and hence all of `semdedup_pruning_tables` won't be read in case some one...
## Description This PR fixes : https://github.com/NVIDIA/NeMo-Curator/pull/61/files by ensuring we always have cuda context spread across multiple GPUs. #### Local Test to verify this: ```python3 #!/usr/bin/env python3 """ Test to...
## Description ## Usage ```python # Add snippet demonstrating usage ``` ## Checklist - [ ] I am familiar with the [Contributing Guide](https://github.com/NVIDIA/NeMo-Curator/blob/main/CONTRIBUTING.md). - [ ] New or Existing tests...
**Describe the bug** While incorporating Ruff, we noticed a few code paths that are buggy. This master issue aims to capture the list of them, so that we can resolve...
## Description Tries to fix #533. LLM gives output topics with descriptions. When we try to create a YAML, those descriptions sometimes get treated as separate topics. We modify the...
I find that when tokenize, gpu utilization is always zero