data issues

Use `DataPipe` not `IterDataPipe` as the type hints for `DataLoader2` related code

2

### 🐛 Describe the bug There are a couple of places that `DataLoader2` uses `IterDataPipe` as the type hint ([here](https://github.com/pytorch/data/blob/12cfaf8899b1337981cd4edf9deef127f925f1bd/torchdata/dataloader2/dataloader2.py#L22), [here](https://github.com/pytorch/data/blob/12cfaf8899b1337981cd4edf9deef127f925f1bd/torchdata/dataloader2/reading_service.py#L17), etc.) We should add a type called `DataPipe =...

ejguan

Is our handling of open files safe?

2

Our current strategy is to wrap all file handles in a [`StreamWrapper`](https://github.com/pytorch/pytorch/blob/88fca3be5924dd089235c72e651f3709e18f76b8/torch/utils/data/datapipes/utils/common.py#L154). It dispatches all calls to wrapped object and adds a `__del__` method: ```py class StreamWrapper: def __init__(self, file_obj):...

pmeier

pointer to a similar library / feedback

1

### 📚 The doc issue Hi! I'm the author of a python library which is called [SeqTools](https://github.com/nlgranger/SeqTools). It predates torchdata and provides essentially the same functionality as `MapDataPipes`. I just...

nlgranger

Add support for sharding filter in distributed settings

4

### 🚀 The feature Implement a `distributed_sharding_filter` that would behave similar to `sharding_filter` (https://github.com/pytorch/pytorch/blob/3f140c5b32fa8685cc7a10bdb94f3f8b127e3a92/torch/utils/data/datapipes/iter/grouping.py), but would filter according to global rank and world size if torch.distributed was initialized. If torch.distributed...

jkulhanek

[BE] Unify `buffer_size` across datapipes

8

The `buffer_size` parameter is currently fairly inconsistent across datapipes: | name | default `buffer_size` | infinite `buffer_size` | warn on infinite | |--------------------|-------------------------|--------------------------|--------------------| | Demultiplexer | 1e3 | -1 |...

pmeier

Better Engineering

Router for same functional API

1

## 🚀 Feature We could support different DataPipe with same functionality using a same functional API with a router datapipe. Like `open`, we can support: - URL - IoPath -...

ejguan

research feature

fastai's DataBlock

3

This new data API looks great and has many similarities with the [DataBlock](https://docs.fast.ai/tutorial.datablock.html) api from fastai. We have a [discord](https://discord.gg/Yy82YcR4) channel for fastai dev and we would love to help/test/integrate...

tcapelle

Support offloading data pre-processing to auxiliary devices

2

### 🚀 The feature, motivation and pitch Occasionally one might find that their GPU is idle due to a bottleneck on the input data pre-processing pipeline (which might include data...

czmrand

feature

module: dataloader

triaged

module: data

SQL Pipe

4

### 🚀 The feature Allows for sourcing datasets from SQL queries. Should allow to substitute in different backends. eg Athena, Presto, Postgres. Should do smart batching to minimize number of...

Nintorac

Reading services should check if spawn is `fork` and cuda context already initiated.

### 🚀 The feature Instead of runtime we can detect CUDA context fork issues at reading service initiation time and point users to the code mistake. ### Motivation, pitch Lots...

VitalyFedyunin

data
data copied to clipboard

Metadata

Use `DataPipe` not `IterDataPipe` as the type hints for `DataLoader2` related code

Is our handling of open files safe?

pointer to a similar library / feedback

Add support for sharding filter in distributed settings

[BE] Unify `buffer_size` across datapipes

Router for same functional API

fastai's DataBlock

Support offloading data pre-processing to auxiliary devices

SQL Pipe

Reading services should check if spawn is `fork` and cuda context already initiated.

← Metadata

Owner

Metadata

data data copied to clipboard

Metadata

← Metadata

Owner

Metadata

data
data copied to clipboard