NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Scalable data pre processing and curation toolkit for LLMs

Results 142 NeMo-Curator issues
Sort by recently updated
recently updated
newest added

**Is your feature request related to a problem? Please describe.** During the minhash script we implicitly convert str id to 2 int ids (doc_id + dataset_id). This is different from...

enhancement
jira

When reading dataset with `DocumentDataset.read_parquet(..., blocksize=???, files_per_partition=None)` and running fuzzy dedup, `protocol=ucx` `false positive=on` we run into an error during the `shuffle_docs_on_buckets` -> `_batched_merge_and_write` step ```python Stage3 (False Postive Check):...

bug
jira

**Is your feature request related to a problem? Please describe.** If nightly scheduled tests fail then we would like to be notified on slack. **Describe the solution you'd like** Code...

enhancement
jira

We should look into enabling best fit packing dataset curation feature. This was used by deepseek and seems like we can use our existing bin packing features to enable it...

enhancement
jira

**Is your feature request related to a problem? Please describe.** We are adding partition_on (https://github.com/NVIDIA/NeMo-Curator/pull/519) here which is very similar to `separate_by_metadata`, we should try to refactor `separate_by_metadata` and make...

enhancement
jira

**Is your feature request related to a problem? Please describe.** My team is currently working on removing PII information from text data that are in South East Asian languages. When...

enhancement
jira

## Description Add a modifier that performs regex replacements. ## Usage ``` regex_params = [ {"pattern": "ö", "repl": "o"}, { "pattern": "[^ !$%',-.0123456789;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz/:]", "repl": "", }, ] modifier = RegexModifier(regex_params)...

## Description Currently, FastTextLangId filter only supports filtering by a language ID filter, but sometimes, we know what the language the data is supposed to be, and it would be...

TODO: - [x] Exact deduplication files - [x] Semantic deduplication files - [x] Fuzzy deduplication files - [x] Tutorials folder

gpuci

## Description Provides functionality to create training datasets for retriever customization ## Usage 1. Semantically cluster documents into partitions: ```python python3 repartition.py --input-dir= --hard-negative-mining-config= --output-dir= --api-key= ``` 2. Mine hard...