NeMo-Curator issues

Faster/More efficient duplicate removal for exact/fuzzy dedup.

1

**Is your feature request related to a problem? Please describe.** The current deduplication examples suggest `compute` on the list of duplicate documents produced via exact/fuzzy deduplication and use the computed...

ayushdg

enhancement

jira

Standardize `text_field`, `id_field`, etc. terminology

As I've worked on several NeMo Curator functionalities, I've been a bit annoyed that our parameter names aren't consistent across different modules. For example, `text_field`, `input_text_field`, `text_column`, `text_column_name`, `input_json_text_field`, `dataset_text_field`,...

sarahyurick

jira

Zyda2 tutorial - TypeError when initializing Dask CPU cluster

1

**Describe the bug** In the Zyda2 tutorial, several scripts like the [process_dclm.py](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/zyda2-tutorial/0_processing/process_dclm.py) attempt to start a Dask LocalCluster. These scripts take an environment variable `CPU_WORKERS = os.environ.get("CPU_WORKERS")` to setup the...

ronjer30

bug

jira

Zyda2 tutorial - key error when running compute_counts script

1

**Describe the bug** When running the [2_compute_counts.py](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/zyda2-tutorial/2_dupes_removal/2_compute_counts.py) script, it fails with an error `Exception: 'KeyError("[\'size\'] not in index")'` **Steps/Code to reproduce bug** 1. Follow steps in [tutorial](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/zyda2-tutorial) 2. Run `python3...

ronjer30

bug

jira

[IMP] Decrease Merge Peak Memory Usage of ConnectedComponents

**Describe the bug** On smaller GPU skews we are running into memory issues in the broadcast merge in Connected Components. We have to decrease that memory footprint without hurting performance...

VibhuJawa

enhancement

jira

Use CrossFit for `TokenizerFertilityFilter`

See https://github.com/NVIDIA/NeMo-Curator/pull/372#discussion_r1844590417 for context.

sarahyurick

enhancement

jira

PII Modifier should support documents greater than pre-configured length

**Is your feature request related to a problem? Please describe.** Under the hood Pii Modifier uses Presidio (which uses spacy I believe). Currently if the documents are very long (I...

praateekmahajan

enhancement

jira

Pii Modifier should work with `DocumentDataset` on cudf

**Is your feature request related to a problem? Please describe.** (not urgent since we anyway have to spill to host memory, but we might benefit from faster I/O and dataset...

praateekmahajan

enhancement

jira

PII Modifier fails to load on worker sporadically raising `cannot reshape array of size`

1

From my experience with trying to run PII Modifier, if you have a fresh docker container and you run `deidentify --device gpu ...` the job might fail due at the...

praateekmahajan

bug

jira

Update minhash API after 25.02

**Is your feature request related to a problem? Please describe.** cuDF 25.02 will deprecate the old `minhash` and rename `minhash_permuted` to `minhash` (See: https://github.com/rapidsai/cudf/pull/17421). Curator should update the MinHash codebase...

ayushdg

enhancement

jira

NeMo-Curator
NeMo-Curator copied to clipboard

Metadata

Faster/More efficient duplicate removal for exact/fuzzy dedup.

Standardize `text_field`, `id_field`, etc. terminology

Zyda2 tutorial - TypeError when initializing Dask CPU cluster

Zyda2 tutorial - key error when running compute_counts script

[IMP] Decrease Merge Peak Memory Usage of ConnectedComponents

Use CrossFit for `TokenizerFertilityFilter`

PII Modifier should support documents greater than pre-configured length

Pii Modifier should work with `DocumentDataset` on cudf

PII Modifier fails to load on worker sporadically raising `cannot reshape array of size`

Update minhash API after 25.02

← Metadata

Owner

Metadata

NeMo-Curator NeMo-Curator copied to clipboard

Metadata

← Metadata

Owner

Metadata

NeMo-Curator
NeMo-Curator copied to clipboard