NeMo-Curator issues

Add features from Llama Nemotron tutorial to NeMo Curator modules

1

While some of the classes in https://github.com/NVIDIA/NeMo-Curator/pull/695 are very specific to the dataset being curated, others could be useful for a variety of datasets and should be added as NeMo...

sarahyurick

enhancement

Multilingual PII support

Currently, a user wanting to implement non-English multilingual PII redaction must edit our source code. This is not ideal. The solution should be pretty simple, see: https://github.com/NVIDIA/NeMo-Curator/blob/c909eb3ec43aba0472e655cc8ab473a48543ea11/docs/user-guide/personalidentifiableinformationidentificationandremoval.rst#multilingual-pii-redaction

sarahyurick

enhancement

jira

Remove dask conditionals from our codebase

**Is your feature request related to a problem? Please describe.** We have a few variables in _compat.py (possibly codepaths too that use these variables) that are for older version of...

praateekmahajan

enhancement

jira

Add classifier CLI script tests

Loosely modeled after the NeMo setup: - https://github.com/NVIDIA/NeMo/tree/main/tests/functional_tests - https://github.com/NVIDIA/NeMo/blob/main/.github/workflows/cicd-main-e2e-tests.yml TODO: - [x] [aegis_classifier_inference](https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/scripts/classifiers/aegis_classifier_inference.py) - [x] [content_type_classifier_inference](https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/scripts/classifiers/content_type_classifier_inference.py) - [x] [domain_classifier_inference](https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/scripts/classifiers/domain_classifier_inference.py) - [x] [fineweb_edu_classifier_inference](https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/scripts/classifiers/fineweb_edu_classifier_inference.py) - [x] [fineweb_mixtral_edu_classifier_inference](https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/scripts/classifiers/fineweb_mixtral_edu_classifier_inference.py) - [x] [fineweb_nemotron_edu_classifier_inference](https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/scripts/classifiers/fineweb_nemotron_edu_classifier_inference.py) -...

sarahyurick

gpuci

SemDedup bug fix for single element cluster

## Description Without this that one single cluster will have datatype of int32 vs float32 for other columns and hence all of `semdedup_pruning_tables` won't be read in case some one...

praateekmahajan

gpuci

Fail loudly for NeMo Curator Dask-Cuda cluster creation CUDA context issues

## Description This PR fixes : https://github.com/NVIDIA/NeMo-Curator/pull/61/files by ensuring we always have cuda context spread across multiple GPUs. #### Local Test to verify this: ```python3 #!/usr/bin/env python3 """ Test to...

VibhuJawa

gpuci

[WIP] Remote I/O in SemDedup

## Description ## Usage ```python # Add snippet demonstrating usage ``` ## Checklist - [ ] I am familiar with the [Contributing Guide](https://github.com/NVIDIA/NeMo-Curator/blob/main/CONTRIBUTING.md). - [ ] New or Existing tests...

praateekmahajan

Ruff Bug fixes in code

**Describe the bug** While incorporating Ruff, we noticed a few code paths that are buggy. This master issue aims to capture the list of them, so that we can resolve...

praateekmahajan

bug

jira

Change prompt to try and get only topic names

1

## Description Tries to fix #533. LLM gives output topics with descriptions. When we try to create a YAML, those descriptions sometimes get treated as separate topics. We modify the...

abhinavg4

When I do fineweb-edu for classifier scoring, how do I overlap the tokenizer with the process of model infer?

3

I find that when tokenize, gpu utilization is always zero

HuaYZhao

enhancement

NeMo-Curator
NeMo-Curator copied to clipboard

Metadata

Add features from Llama Nemotron tutorial to NeMo Curator modules

Multilingual PII support

Remove dask conditionals from our codebase

Add classifier CLI script tests

SemDedup bug fix for single element cluster

Fail loudly for NeMo Curator Dask-Cuda cluster creation CUDA context issues

[WIP] Remote I/O in SemDedup

Ruff Bug fixes in code

Change prompt to try and get only topic names

When I do fineweb-edu for classifier scoring, how do I overlap the tokenizer with the process of model infer?

← Metadata

Owner

Metadata

NeMo-Curator NeMo-Curator copied to clipboard

Metadata

← Metadata

Owner

Metadata

NeMo-Curator
NeMo-Curator copied to clipboard