data issues

Mux with MPRS causes operations after sharding_round_robin_dispatcher to run on the same worker

3

### 📚 The doc issue This doesn't seem to be mentioned in the docs, but if you have two datapipes that use `sharding_round_robin_dispatcher` and then `mux` them together: 1. Any...

JohnHBrock

MPRS hangs if child processes get killed externally (e.g. from OOM reaper)

3

### 🐛 Describe the bug If either a worker process or the feeder process of the MPRS get killed, the main process will just hang indefinitely and not throw an...

sehoffmann

Support for proper Distributed & Multiprocessing Sharding

### 🚀 The feature In MPI-based training, each process is independent from each other. Each training process might want to speed up dataloading using multiprocessing (MP). This requires data sharding...

sehoffmann

[RFC] Performance Profiling Tools

3

### 🚀 The feature 1. Store usage statistics in `Prefetcher` - By tracking statistics within `Prefetcher`, we can reasonably determine whether upstream processes or downstream processes are faster. For example,...

NivekT

topic: new feature

What does it mean for a DataPipe to be 'replicable'?

4

### 📚 The doc issue In the [ReadingService docs](https://pytorch.org/data/main/reading_service.html?highlight=replicable) the different sharding options and that one applies to replicable and one to non-replicable datapipes, but it's not really explained what...

lendle

Zipped csv file cannot be parsed using torchdata’s CSVParser datapipe

1

### 🐛 Describe the bug I’m trying to parse a single csv file that is zipped and stored in aws s3, but getting the following error: ``` Exception when executing...

seunggs

apply_sharding() check does not care about sharding priorities

8

### 🐛 Describe the bug The following, in my opinion valid, snippet fails with ``` import torchdata.datapipes as dp from torch.utils.data.datapipes.iter.sharding import SHARDING_PRIORITIES from torchdata.dataloader2 import MultiProcessingReadingService, DataLoader2 pipe =...

sehoffmann

Allow custom sharding datapipes

5

### 🚀 The feature https://github.com/pytorch/pytorch/blob/master/torch/utils/data/graph_settings.py#L51 currently explicitely checks for `_ShardingIterDataPipe` which is 1. a private type, and 2, not in line with e.g. how `apply_shuffle_settings` works (checking for presence of...

sehoffmann

Caching doesn't work with cycle

### 🐛 Describe the bug When cycle with later caching is used it works the very first time but afterwards it crashes because of the demux because it checks infinitely...

Modexus

MPRS with DL2 memory use increases over time vs. using older DL

### 🐛 Describe the bug MPRS version: ```python train_dp = IterableWrapper(train_ds).batch(batch_size=batch_size).collate(collate_fn=encode_processor) rs = MultiProcessingReadingService(num_workers=num_workers) train_dl = DataLoader2(train_dp, reading_service=rs) for batch_idx, batch in enumerate(train_dl): print(batch_idx) if batch_idx > 500: break ```...

rejuvyesh

data
data copied to clipboard

Metadata

Mux with MPRS causes operations after sharding_round_robin_dispatcher to run on the same worker

MPRS hangs if child processes get killed externally (e.g. from OOM reaper)

Support for proper Distributed & Multiprocessing Sharding

[RFC] Performance Profiling Tools

What does it mean for a DataPipe to be 'replicable'?

Zipped csv file cannot be parsed using torchdata’s CSVParser datapipe

apply_sharding() check does not care about sharding priorities

Allow custom sharding datapipes

Caching doesn't work with cycle

MPRS with DL2 memory use increases over time vs. using older DL

← Metadata

Owner

Metadata

data data copied to clipboard

Metadata

← Metadata

Owner

Metadata

data
data copied to clipboard