NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Scalable data pre processing and curation toolkit for LLMs

Results 142 NeMo-Curator issues
Sort by recently updated
recently updated
newest added
trafficstars

**Is your feature request related to a problem? Please describe.** The current deduplication examples suggest `compute` on the list of duplicate documents produced via exact/fuzzy deduplication and use the computed...

enhancement
jira

As I've worked on several NeMo Curator functionalities, I've been a bit annoyed that our parameter names aren't consistent across different modules. For example, `text_field`, `input_text_field`, `text_column`, `text_column_name`, `input_json_text_field`, `dataset_text_field`,...

jira

**Describe the bug** In the Zyda2 tutorial, several scripts like the [process_dclm.py](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/zyda2-tutorial/0_processing/process_dclm.py) attempt to start a Dask LocalCluster. These scripts take an environment variable `CPU_WORKERS = os.environ.get("CPU_WORKERS")` to setup the...

bug
jira

**Describe the bug** When running the [2_compute_counts.py](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/zyda2-tutorial/2_dupes_removal/2_compute_counts.py) script, it fails with an error `Exception: 'KeyError("[\'size\'] not in index")'` **Steps/Code to reproduce bug** 1. Follow steps in [tutorial](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/zyda2-tutorial) 2. Run `python3...

bug
jira

**Describe the bug** On smaller GPU skews we are running into memory issues in the broadcast merge in Connected Components. We have to decrease that memory footprint without hurting performance...

enhancement
jira

See https://github.com/NVIDIA/NeMo-Curator/pull/372#discussion_r1844590417 for context.

enhancement
jira

**Is your feature request related to a problem? Please describe.** Under the hood Pii Modifier uses Presidio (which uses spacy I believe). Currently if the documents are very long (I...

enhancement
jira

**Is your feature request related to a problem? Please describe.** (not urgent since we anyway have to spill to host memory, but we might benefit from faster I/O and dataset...

enhancement
jira

From my experience with trying to run PII Modifier, if you have a fresh docker container and you run `deidentify --device gpu ...` the job might fail due at the...

bug
jira

**Is your feature request related to a problem? Please describe.** cuDF 25.02 will deprecate the old `minhash` and rename `minhash_permuted` to `minhash` (See: https://github.com/rapidsai/cudf/pull/17421). Curator should update the MinHash codebase...

enhancement
jira