Praateek Mahajan issues

Results 9 issues of


                                            Praateek Mahajan

Issue with io/ffmpeg.py

``` RuntimeError: Traceback (most recent call last): File "/home/praateek/.local/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 40, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "", line 32, in __getitem__ vid = skvideo.io.vread(self.__xs[index]) File...

[FR] Support SageMaker multi-model endpoints

## Describe the proposal MLFlow currently seems to deploy each model to a new SageMaker instance. Since Novermber 2019, SageMaker has come up with something called multi-model endpoint, which allows...

enhancement

Acknowledged

integrations/sagemaker

[DRAFT] Trying dask_cudf's read_json / read_parquet

## Description Reading 6000 files of ~25mb each, i.e ~145gb over 8GPUs | add_filename | partition_size | input_meta | Using `dask.read_json` #285 | Providing meta in `dask.from_map` #291 | |--------|--------|--------|--------|---------|...

[DRAFT] Passing meta to map_partitions for read_data

## Description ## Usage ```python # Add snippet demonstrating usage ``` ## Checklist - [ ] I am familiar with the [Contributing Guide](https://github.com/NVIDIA/NeMo-Curator/blob/main/CONTRIBUTING.md). - [ ] New or Existing tests...

Curator should support numpy > 2

**Is your feature request related to a problem? Please describe.** Currently [numpy is restricted to < 2](https://github.com/NVIDIA/NeMo-Curator/blob/fa4befcad0a804d9b8ad4a9870b2fd87196d2d26/requirements/requirements.txt#L17). But in cudf 24.10 release [numpy allows 2.0 release](https://github.com/rapidsai/cudf/blob/branch-24.10/python/cudf/pyproject.toml#L28). However we tried just...

enhancement

Semantic Dedup doesn't work with UCX

**Describe the bug** Semantic Dedup often gets stuck at the state when we call `semantic_cluster_dedup.extract_dedup_data`. **Steps/Code to reproduce bug** Run semantic dedup when the `client = get_client(device_type='gpu', protocol='ucx')` **Environment overview**...

bug

[POC] GPT2Tokenizer using cudf

This pull request is a Proof of Concept for `GPT2Tokenizer` in the file `python/cudf/cudf/core/gpt2_tokenizer.py`. The `GPT2Tokenizer` class is designed to tokenize a cuDF strings column using CUDA GPT2 subword tokenizer...

libcudf

Python

CMake

Java

conda

feat: Add NVIDIA as a Provider and expose it to Goose

Creates a provider in `exchange`. In goose, adds a config for the NVIDIA provider (using `llama-3.1-405b`) to `default_model_configuration`

enhancement

help wanted

work-in-progress

feat: Add devcontainer

Makes development easier in vscode world. See https://code.visualstudio.com/docs/devcontainers/containers Allows folks to contribute more easily as long as they have the `Dev Containers` extension, they should be able to open the...