Curator icon indicating copy to clipboard operation
Curator copied to clipboard

Scalable data pre processing and curation toolkit for LLMs

Results 239 Curator issues
Sort by recently updated
recently updated
newest added

## Description CC: @ayushdg ## Checklist - [x] I am familiar with the [Contributing Guide](https://github.com/NVIDIA/NeMo-Curator/blob/main/CONTRIBUTING.md). - [x] New or Existing tests cover these changes. - [x] The documentation is up...

**Describe the bug** The warning ``` 2024-10-11 00:04:31,529 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be...

bug

## Description This PR ensures that users can run the PEFT SDG tutorial using arbitrary API endpoints by exposing the URL that is used for synthetic data generation. ## Checklist...

We have attempted to run tutorials/peft-curation-with-sdg and facing runtime errors, details are mentioned below with the environment setup information we tried. ``` python ./main.py \ --api-key \ --device gpu \...

bug

**Describe the bug** Currently our semdedup restart mechanism for embedding is not working cleanly. This is because of following (` add_filename=False`) https://github.com/NVIDIA/NeMo-Curator/blob/3a31ab13137e43fd8c1ffd40e07c52606a852acb/nemo_curator/scripts/semdedup/compute_embeddings.py#L62-L64 And write to filename is False https://github.com/NVIDIA/NeMo-Curator/blob/3a31ab13137e43fd8c1ffd40e07c52606a852acb/nemo_curator/scripts/semdedup/compute_embeddings.py#L78 And...

bug

**Is your feature request related to a problem? Please describe.** I’m working through [these Classifier and Heuristic Quality Filtering docs](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/qualityfiltering.html#data-curator-qualityfiltering). I’m looking for an elegant way to write filtered docs...

enhancement

In some scenarios, a corpus file may contain columns that are not needed during the data curation step. We might reduce memory footprint by allowing the user to specify which...

enhancement

**Is your feature request related to a problem? Please describe.** We need to assert > 1 worker We need to add a check after this : https://github.com/NVIDIA/NeMo-Curator/blob/9a424c7a498519aeb971a3453ae71447b952b500/nemo_curator/utils/distributed_utils.py#L156

enhancement

**Is your feature request related to a problem? Please describe.** Based on user feedback we need to fix the following to make user experience better: - [x] [Enable PyTorch to...

enhancement

## Description ## Usage ```python # Add snippet demonstrating usage ``` ## Checklist - [ ] I am familiar with the [Contributing Guide](https://github.com/NVIDIA/NeMo-Curator/blob/main/CONTRIBUTING.md). - [ ] New or Existing tests...