NeMo-Curator issues

Multilingual Data Curation tutorial

3

Update deduplication scripts

When running the [extract_dedup_data.py](https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/scripts/semdedup/extract_dedup_data.py#L61) script, the user may encounter a warning: ``` UserWarning: Insufficient elements for `head`. 10 elements requested, only 0 elements available. Try passing larger `npartitions` to `head`....

sarahyurick

documentation

good first issue

[FEA] Remove GPU-related messages on CPU-only servers

1

Hi! Would it be possible to hide this kind of errors when running NeMo Curator in a CPU only server? > /usr/local/lib/python3.10/dist-packages/cudf/utils/_ptxcompiler.py:61: UserWarning: Error getting driver and runtime versions: I...

miguelusque

enhancement

jira

Add example of how to resume an interrupted `download_common_crawl` job

Since `download_common_crawl` can be quite a large job (with ~100,000 WARC files per full snapshot), it is possible that a user's job may be interrupted, stopped, or cancelled for various...

sarahyurick

documentation

fuzzy_dedup OOM issue

4

**Describe the bug** Use 5*A100 GPUs to do fuzzey_dedup task and encountered OOM issues. here is error info ``` 2024-12-31 05:02:43,370 - distributed.worker - ERROR - Could not serialize object...

chenrui17

bug

jira

`LookupError` not caught during Encoding handling

4

**Describe the bug** In the Data curation for DAPT tutorial (`tutorials/dapt-curation`) when attempting to decode files with an encoding that is not supported by the system (e.g., Vietnamese's VISCII in...

ggcr

bug

jira

Running Curator under SLURM Cluster

6

Hello all, I have a quick question. I just want to make sure that my workflow is correct and my path to installation is correct. I am wanting to the...

philm001

nemo_curator.utils.distributed_utils.read_data doesn't work for my own parquet dataset unless cleaning text by myself

1

**Describe the bug** I encountered the following bug when using our own Parquet dataset with nemo_curator.utils.distributed_utils.read_data and nemo_curator.AddId operations, following the approach outlined in this [tutorial](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/pretraining-data-curation/red-pajama-v2-curation-tutorial.ipynb). However, when I manually...

RickyShi46

bug

jira

Consecutive execution of fuzzy deduplication on different columns fails with errors

4

Python script to reproduce: ``` from functools import partial from nemo_curator import FuzzyDuplicates, FuzzyDuplicatesConfig, Sequential, get_client from nemo_curator.datasets import DocumentDataset def fuzzy_dedupe(dataset, cache_dir, id_field, text_field): # dataset.df.reset_index(drop=True) fuzzy_dedup_config = FuzzyDuplicatesConfig(...

sarahyurick

bug

jira

Add more Docker instructions to README

We should add more specific instructions (i.e., docker commands) for how a user can use NeMo Curator via the NeMo Framework Container. It would also be helpful to include instructions...

sarahyurick

documentation

NeMo-Curator
NeMo-Curator copied to clipboard

Metadata

Multilingual Data Curation tutorial

Update deduplication scripts

[FEA] Remove GPU-related messages on CPU-only servers

Add example of how to resume an interrupted `download_common_crawl` job

fuzzy_dedup OOM issue

`LookupError` not caught during Encoding handling

Running Curator under SLURM Cluster

nemo_curator.utils.distributed_utils.read_data doesn't work for my own parquet dataset unless cleaning text by myself

Consecutive execution of fuzzy deduplication on different columns fails with errors

Add more Docker instructions to README

← Metadata

Owner

Metadata

NeMo-Curator NeMo-Curator copied to clipboard

Metadata

← Metadata

Owner

Metadata

NeMo-Curator
NeMo-Curator copied to clipboard