NeMo-Curator
NeMo-Curator copied to clipboard
Scalable data pre processing and curation toolkit for LLMs
**Is your feature request related to a problem? Please describe.** During the minhash script we implicitly convert str id to 2 int ids (doc_id + dataset_id). This is different from...
When reading dataset with `DocumentDataset.read_parquet(..., blocksize=???, files_per_partition=None)` and running fuzzy dedup, `protocol=ucx` `false positive=on` we run into an error during the `shuffle_docs_on_buckets` -> `_batched_merge_and_write` step ```python Stage3 (False Postive Check):...
**Is your feature request related to a problem? Please describe.** If nightly scheduled tests fail then we would like to be notified on slack. **Describe the solution you'd like** Code...
We should look into enabling best fit packing dataset curation feature. This was used by deepseek and seems like we can use our existing bin packing features to enable it...
**Is your feature request related to a problem? Please describe.** We are adding partition_on (https://github.com/NVIDIA/NeMo-Curator/pull/519) here which is very similar to `separate_by_metadata`, we should try to refactor `separate_by_metadata` and make...
**Is your feature request related to a problem? Please describe.** My team is currently working on removing PII information from text data that are in South East Asian languages. When...
## Description Add a modifier that performs regex replacements. ## Usage ``` regex_params = [ {"pattern": "ö", "repl": "o"}, { "pattern": "[^ !$%',-.0123456789;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz/:]", "repl": "", }, ] modifier = RegexModifier(regex_params)...
## Description Currently, FastTextLangId filter only supports filtering by a language ID filter, but sometimes, we know what the language the data is supposed to be, and it would be...
TODO: - [x] Exact deduplication files - [x] Semantic deduplication files - [x] Fuzzy deduplication files - [x] Tutorials folder
## Description Provides functionality to create training datasets for retriever customization ## Usage 1. Semantically cluster documents into partitions: ```python python3 repartition.py --input-dir= --hard-negative-mining-config= --output-dir= --api-key= ``` 2. Mine hard...