data issues

S3FileLoader clears contents of local file when s3 object name == local file relative path name

2

### 🐛 Describe the bug When an S3 object name matches the relative path of a local file, the file's contents get cleared after loading the object data. ```python import...

ringohoffman

[WIP] added islice + flatten for iterable datapipes

Please read through our [contribution guide](https://github.com/pytorch/data/blob/main/CONTRIBUTING.md) prior to creating your pull request. - Note that there is a section on requirements related to adding a new DataPipe. Fixes #656 ###...

dbish

CLA Signed

[WIP] Benchmark Script

Stack from [ghstack](https://github.com/ezyang/ghstack): * __->__ #734 This PR is primarily focused on adding more datasets for benchmarking. Notable changes that are in progress: - Using `PrototypeMultiprocessingReadingService` as that will become...

NivekT

CLA Signed

Recommended way to shuffle intra and inter archives?

8

Say I have a bunch of archives containing samples. In my case each archive is a pickle file containing a list of samples, but it could be a tar or...

NicolasHug

Chainer/Concater from single datapipe?

7

The `Concater` datapipe takes multiple DPs as input. Is there a class that would take a **single** datapipe of iterables instead? Something like this: ```py class ConcaterIterable(IterDataPipe): def __init__(self, source_datapipe):...

NicolasHug

good first issue

[DataPipe] Add RandomSplitter (without buffer)

8

Stack from [ghstack](https://github.com/ezyang/ghstack): * __->__ #724 This PR adds RandomSplitter without a buffer. The upside is that this uses less memory (good for memory-bound cases) but the downside are 1)...

NivekT

CLA Signed

topic: new feature

Implement DistribtuedReadingService

Per title

ejguan

CLA Signed

[DataLoader] Add len to DataLoader2

5

Stack from [ghstack](https://github.com/ezyang/ghstack): * __->__ #728 * #746 Adding `__len__` to `DataLoader2`. See inline comments. We should discuss the details and if this makes sense. Fixes #549 Differential Revision: [D38999743](https://our.internmc.facebook.com/intern/diff/D38999743)

NivekT

CLA Signed

topic: improvements

Ability to manipulate columns and fields

8

### 🚀 The feature - [x] Add ability to drop specific column / field For `list` and `tuple` ```python list(dp) # [ (0, 1, 2), (3, 4, 5), (6,7, 8)...

VitalyFedyunin

good first issue

Add Examples of Common Preprocessing Steps with IterDataPipe (such as splitting a data set into two)

7

### 📚 The doc issue There are a few common steps that users often would like to do while preprocessing data, such as [splitting their data set](https://pytorch.org/docs/stable/data.html#torch.utils.data.random_split) into train and...

NivekT

documentation

data
data copied to clipboard

Metadata

S3FileLoader clears contents of local file when s3 object name == local file relative path name

[WIP] added islice + flatten for iterable datapipes

[WIP] Benchmark Script

Recommended way to shuffle intra and inter archives?

Chainer/Concater from single datapipe?

[DataPipe] Add RandomSplitter (without buffer)

Implement DistribtuedReadingService

[DataLoader] Add len to DataLoader2

Ability to manipulate columns and fields

Add Examples of Common Preprocessing Steps with IterDataPipe (such as splitting a data set into two)

← Metadata

Owner

Metadata

data data copied to clipboard

Metadata

← Metadata

Owner

Metadata

data
data copied to clipboard