NeMo-Curator
NeMo-Curator copied to clipboard
Scalable data pre processing and curation toolkit for LLMs
**Describe the bug** Running the add id module of curator runs into ooms even with small batch size, e.g., 32. The dataset for adding ID is a single snapshot of...
When training an LLM on code-related tasks, it has been empirically proven that paring source code with its intermediate representation (IR) improves the LLM capabilities (https://arxiv.org/abs/2403.03894). Adding this capability to...
Sometimes, it is needed to detect the code license of the corpus being curated. Some code repositories, such as [GitHub](https://www.github.com), provides an API to identify the repository license, but other...
Currently, when trying out [this notebook](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/distributed_data_classification/distributed_data_classification.ipynb) with a CPU Dask DataFrame, it fails with a `TypeError: batch_text_or_text_pairs has to be a list or a tuple (got )`. To reproduce, use...
As far as I am aware, this bug happens *only* when running interactively on a Jupyter Notebook connected to a Dask cluster; when running a regular Jupyter Notebook with a...
**Describe the bug** The single gpu tutorial [notebook](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb) fails to launch a GPU based Dask cluster **Steps/Code to reproduce bug** 1. Launch notebook 2. Run all steps in 0.Env Setup...
As I am revisiting the semantic deduplication documentation, there are a few things we should add: - Documentation of the CLI - If the user uses `add_id` like we recommend,...
Right now, there is some confusion around DataFrames being passed into `DocumentDataset`. For now, we expect them to be Dask or Dask-cuDF DataFrames, so we should add stronger type checking...
**Describe the bug** We should update our python version to include 3.11, we currently hard pin it to 3.10. I think this was done to maintain RAPIDS compatibility which is...
**Describe the bug** The single gpu tutorial [notebook](https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb) refers to a snapshot that is not available and it is hardcoded within several cells in the notebook **Steps/Code to reproduce bug**...