NeMo-Curator
NeMo-Curator copied to clipboard
Scalable data pre processing and curation toolkit for LLMs
When running the [extract_dedup_data.py](https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/scripts/semdedup/extract_dedup_data.py#L61) script, the user may encounter a warning: ``` UserWarning: Insufficient elements for `head`. 10 elements requested, only 0 elements available. Try passing larger `npartitions` to `head`....
Hi! Would it be possible to hide this kind of errors when running NeMo Curator in a CPU only server? > /usr/local/lib/python3.10/dist-packages/cudf/utils/_ptxcompiler.py:61: UserWarning: Error getting driver and runtime versions: I...
Since `download_common_crawl` can be quite a large job (with ~100,000 WARC files per full snapshot), it is possible that a user's job may be interrupted, stopped, or cancelled for various...
**Describe the bug** Use 5*A100 GPUs to do fuzzey_dedup task and encountered OOM issues. here is error info ``` 2024-12-31 05:02:43,370 - distributed.worker - ERROR - Could not serialize object...
**Describe the bug** In the Data curation for DAPT tutorial (`tutorials/dapt-curation`) when attempting to decode files with an encoding that is not supported by the system (e.g., Vietnamese's VISCII in...
Hello all, I have a quick question. I just want to make sure that my workflow is correct and my path to installation is correct. I am wanting to the...
**Describe the bug** I encountered the following bug when using our own Parquet dataset with nemo_curator.utils.distributed_utils.read_data and nemo_curator.AddId operations, following the approach outlined in this [tutorial](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/pretraining-data-curation/red-pajama-v2-curation-tutorial.ipynb). However, when I manually...
Python script to reproduce: ``` from functools import partial from nemo_curator import FuzzyDuplicates, FuzzyDuplicatesConfig, Sequential, get_client from nemo_curator.datasets import DocumentDataset def fuzzy_dedupe(dataset, cache_dir, id_field, text_field): # dataset.df.reset_index(drop=True) fuzzy_dedup_config = FuzzyDuplicatesConfig(...
We should add more specific instructions (i.e., docker commands) for how a user can use NeMo Curator via the NeMo Framework Container. It would also be helpful to include instructions...