Efficient grouping large datasets

Open samirchar opened this issue 4 months ago • 0 comments

Hello everyone,

I have a use case where I need to train OpenClip with a very large dataset (e.g., LAION-400M or larger).

Suppose I download this dataset normally, so each sample contains an image and a caption. Let's say I now have different and separate webdatasets with additional data for each image, for example, another description and other information. How can I safely and efficiently merge these in Python ? I don't want a new set of tar files, just merge in Python so I can iterate over the merged dataset.

I looked at the column store example, but the solution with the add_coumn function assumes the files perfectly align and the iterator will work on a single node. That is, the .compose needs to happen first, and then I can distribute by node and worker. Not sure if I need to use wids, group_by_keys, or something else. Maybe the webdataset with the additional data should be in another format instead? Another option could be to pack everything into single webdataset.

Would really appreciate the help.

Thanks!

Aug 29 '25 01:08 samirchar