litdata icon indicating copy to clipboard operation
litdata copied to clipboard

Add batch sampler for StreamDataloader to enable flexiable training strategy.

Open xinsir6 opened this issue 9 months ago • 2 comments

Hi, I just encounter a problem that associated with litdata. I reformat my dataset and dataloader with StreamDataset and StreamingDataLoader, but I found that the StreamingDataLoader can't accept the sampler function, which makes my original training strategy fails. I use bucket data training, which will put the images with same width/height ratio into the same batch, so I need to realize a special batch sampler to realize this. I need to deliever this sampler to the dataloader. But when I use StreamingDataLoader, I found it doesn't support this, so I can't realize the bucket training strategy, which will make my training sub-optimal. For example, I have prepare my data seqlen like [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], when I use a batch-size 4, I can guarantee that 0 1 2 3 and 4 5 6 7 and 8 9 10 11 and 12 13 14 15 will have the same bucket size, so in single GPU training, it runs fine. But when I switch to multi-gpu traning, the StreamingDataLoader will shuffle the seqlen, so I will get images that belong to different buckets, so it will raise error in the following collate function. Please add the sampler function for the StreamingDataLoader, or can you have any way to help me solve this problem?

xinsir6 avatar Mar 11 '25 16:03 xinsir6

Hi! thanks for your contribution!, great first issue!

github-actions[bot] avatar Mar 11 '25 16:03 github-actions[bot]

Hi @xinsir6, thanks for the detailed issue! Currently, StreamingDataLoader doesn’t support custom batch samplers. As a workaround, you can optimize your data into separate datasets based on the sizes and use CombinedStreamingDataset(..., batching_method="per_stream") to combine all of the datasets and stream based on the batching_method = per_stream. Hope this helps!

bhimrazy avatar Jun 03 '25 10:06 bhimrazy