NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Scalable data pre processing and curation toolkit for LLMs

Results 142 NeMo-Curator issues
Sort by recently updated
recently updated
newest added
trafficstars

Initial discussion happened with @VibhuJawa **Is your feature request related to a problem? Please describe.** While running a workflow on slurm with large files, if it needs to be cancelled...

enhancement
jira

## Description This PR is adding a translation example to hindi languge via ct2 model. Issue [here](https://github.com/NVIDIA/NeMo-Curator/issues/246) This example depends on CrossFit's [PR](https://github.com/rapidsai/crossfit/pull/83) ## Checklist - [ ] I am...

documentation

There are a couple of GitHub Actions I want to add to NeMo Curator: - [x] GPU CI (PR: https://github.com/NVIDIA/NeMo-Curator/pull/253) - [x] Always have it run without having to re-add...

We should see if we can add DCO signing as a pre-commit check , a lot of new developers get stumped by this, we should make sure that this process...

enhancement

## Description Reading 6000 files of ~25mb each, i.e ~145gb over 8GPUs | add_filename | partition_size | input_meta | Using `dask.read_json` #285 | Providing meta in `dask.from_map` #291 | |--------|--------|--------|--------|---------|...

## Description ## Usage ```python # Add snippet demonstrating usage ``` ## Checklist - [ ] I am familiar with the [Contributing Guide](https://github.com/NVIDIA/NeMo-Curator/blob/main/CONTRIBUTING.md). - [ ] New or Existing tests...

We should retire text_bytes_aware_shuffle as we have https://github.com/NVIDIA/NeMo-Curator/pull/77 merged in now . That will mean we refactor below code . https://github.com/NVIDIA/NeMo-Curator/blob/c2f296cb752e06c1b8f6d9bd28e618105320bce5/nemo_curator/modules/fuzzy_dedup.py#L1119-L1144

**Describe the bug** We have had multiple breakages of CUDA context being only used for GPU 0 in a dask+pytorch environment. Sometimes this can occur due to a library creating...

enhancement
jira

**Is your feature request related to a problem? Please describe.** Currently [numpy is restricted to < 2](https://github.com/NVIDIA/NeMo-Curator/blob/fa4befcad0a804d9b8ad4a9870b2fd87196d2d26/requirements/requirements.txt#L17). But in cudf 24.10 release [numpy allows 2.0 release](https://github.com/rapidsai/cudf/blob/branch-24.10/python/cudf/pyproject.toml#L28). However we tried just...

enhancement

## Description This PR enables ctranslate2 model translation. This will work when CrossFit support for ctranslate2 model is added.([PR](https://github.com/rapidsai/crossfit/pull/83)) ## Usage ``` python3 NeMo-Curator/examples/ct2_trasnlation_example.py --input-data-dir --output-data-dir --ct2-model-path --files-per-partition 1 --input-text-field...