NeMo-Curator
NeMo-Curator copied to clipboard
Scalable data pre processing and curation toolkit for LLMs
## Description CC: @ayushdg ## Checklist - [x] I am familiar with the [Contributing Guide](https://github.com/NVIDIA/NeMo-Curator/blob/main/CONTRIBUTING.md). - [x] New or Existing tests cover these changes. - [x] The documentation is up...
**Describe the bug** The warning ``` 2024-10-11 00:04:31,529 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be...
## Description This PR ensures that users can run the PEFT SDG tutorial using arbitrary API endpoints by exposing the URL that is used for synthetic data generation. ## Checklist...
We have attempted to run tutorials/peft-curation-with-sdg and facing runtime errors, details are mentioned below with the environment setup information we tried. ``` python ./main.py \ --api-key \ --device gpu \...
**Describe the bug** Currently our semdedup restart mechanism for embedding is not working cleanly. This is because of following (` add_filename=False`) https://github.com/NVIDIA/NeMo-Curator/blob/3a31ab13137e43fd8c1ffd40e07c52606a852acb/nemo_curator/scripts/semdedup/compute_embeddings.py#L62-L64 And write to filename is False https://github.com/NVIDIA/NeMo-Curator/blob/3a31ab13137e43fd8c1ffd40e07c52606a852acb/nemo_curator/scripts/semdedup/compute_embeddings.py#L78 And...
**Is your feature request related to a problem? Please describe.** I’m working through [these Classifier and Heuristic Quality Filtering docs](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/qualityfiltering.html#data-curator-qualityfiltering). I’m looking for an elegant way to write filtered docs...
In some scenarios, a corpus file may contain columns that are not needed during the data curation step. We might reduce memory footprint by allowing the user to specify which...
**Is your feature request related to a problem? Please describe.** We need to assert > 1 worker We need to add a check after this : https://github.com/NVIDIA/NeMo-Curator/blob/9a424c7a498519aeb971a3453ae71447b952b500/nemo_curator/utils/distributed_utils.py#L156
**Is your feature request related to a problem? Please describe.** Based on user feedback we need to fix the following to make user experience better: - [x] [Enable PyTorch to...
## Description ## Usage ```python # Add snippet demonstrating usage ``` ## Checklist - [ ] I am familiar with the [Contributing Guide](https://github.com/NVIDIA/NeMo-Curator/blob/main/CONTRIBUTING.md). - [ ] New or Existing tests...