NeMo-Curator
NeMo-Curator copied to clipboard
Scalable data pre processing and curation toolkit for LLMs
**Is your feature request related to a problem? Please describe.** The current deduplication examples suggest `compute` on the list of duplicate documents produced via exact/fuzzy deduplication and use the computed...
As I've worked on several NeMo Curator functionalities, I've been a bit annoyed that our parameter names aren't consistent across different modules. For example, `text_field`, `input_text_field`, `text_column`, `text_column_name`, `input_json_text_field`, `dataset_text_field`,...
**Describe the bug** In the Zyda2 tutorial, several scripts like the [process_dclm.py](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/zyda2-tutorial/0_processing/process_dclm.py) attempt to start a Dask LocalCluster. These scripts take an environment variable `CPU_WORKERS = os.environ.get("CPU_WORKERS")` to setup the...
**Describe the bug** When running the [2_compute_counts.py](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/zyda2-tutorial/2_dupes_removal/2_compute_counts.py) script, it fails with an error `Exception: 'KeyError("[\'size\'] not in index")'` **Steps/Code to reproduce bug** 1. Follow steps in [tutorial](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/zyda2-tutorial) 2. Run `python3...
**Describe the bug** On smaller GPU skews we are running into memory issues in the broadcast merge in Connected Components. We have to decrease that memory footprint without hurting performance...
See https://github.com/NVIDIA/NeMo-Curator/pull/372#discussion_r1844590417 for context.
**Is your feature request related to a problem? Please describe.** Under the hood Pii Modifier uses Presidio (which uses spacy I believe). Currently if the documents are very long (I...
**Is your feature request related to a problem? Please describe.** (not urgent since we anyway have to spill to host memory, but we might benefit from faster I/O and dataset...
From my experience with trying to run PII Modifier, if you have a fresh docker container and you run `deidentify --device gpu ...` the job might fail due at the...
**Is your feature request related to a problem? Please describe.** cuDF 25.02 will deprecate the old `minhash` and rename `minhash_permuted` to `minhash` (See: https://github.com/rapidsai/cudf/pull/17421). Curator should update the MinHash codebase...