litdata icon indicating copy to clipboard operation
litdata copied to clipboard

How can I shut down automatically distributing data when using StreamingDataset?

Open ygtxr1997 opened this issue 1 year ago • 4 comments

🚀 Feature

Giving the option to shut down the automatic data distributed sampler when using StreamingDataset.

Motivation

When I use StreamingDataset in 'DDP' environment, the dataset_len of StreamingDataset seems always be original_len/world_size. But I want the different processes (with different local_ranks) to share the totally same StreamingDataset, without any data splitting.

Pitch

How to stop the automatic data distribution when using StreamingDataset in DDP? Or could you provide a setting to this? Or could you explain why we can't shut down the distribution?

Alternatives

Additional context

ygtxr1997 avatar Sep 12 '24 15:09 ygtxr1997

Hi! thanks for your contribution!, great first issue!

github-actions[bot] avatar Sep 12 '24 15:09 github-actions[bot]

Hey @ygtxr1997. You can override the distributed env on the dataset. It is inferred automatically from torch.

What is your use case ?

tchaton avatar Sep 14 '24 13:09 tchaton

Hey @ygtxr1997. You can override the distributed env on the dataset. It is inferred automatically from torch.

What is your use case ?

I think overriding the litdata class should be a method to solve my above issue.

Initially, I want to use litdata to optimize my dataset which consists of ~500k small files (each is about 200KB file size). All these files are stored in a remote storage server. However, which is a bit different from other image datasets, my dataloader needs to read from two distinct files, and the indices between the two file vary in [20,50], like a sliding window. For instance, the file indices in a data batch (batch_size=4) could be like this:

batch: [100,120], [1000,1030], [1020,1070], [500,550]

According to your example of usage and distributed data loading illustration GIF, litdata seems not good at dealing with such random-read-like case, am I right? Maybe the performance depends on how the original files are merged into a litdata chunk. Maybe keeping the order of original files (from small index to large index) could result in a faster loading speed, but this potentially impact the learning of deep models?

Therefore, I don't know if litdata could help me and boost the data loading speed in my case.

ygtxr1997 avatar Sep 16 '24 12:09 ygtxr1997

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 16 '25 05:04 stale[bot]