data icon indicating copy to clipboard operation
data copied to clipboard

[DataPipe] shard expander

Open tmbdev opened this issue 3 years ago • 6 comments

This PR adds a ShardExpander filter, a filter that will take shard specs of the form "prefix-{000..999}.tar" and expand them into the 1000 corresponding file names, like the shell brace expansion. This only implements numerical range expansion (rather than full shell-style brace expansion).

Specifying collections of files in terms of numerical shard specs (instead of using classes like FileLister) has a number of advantages: (1) it does not require the ability to list files on the target (not possible in general for, for example, HTTP), (2) it documents and makes specifying a collection of files reproducible and independent of the state of the storage system, (3) it makes it easy to choose different dataset sizes by specifying different numerical subranges.

ShardExpander is frequently used with WebDataset to specify collections of training shards.

tmbdev avatar May 13 '22 20:05 tmbdev

@tmbdev Do you still plan to work on this (and related) PRs?

tadejsv avatar Aug 20 '22 16:08 tadejsv

Yes, I'm planning on working on the handful of PRs we have been discussing and addressing the issues raised.

tmbdev avatar Aug 20 '22 16:08 tmbdev

I've update the PR and I believe I have addressed all the comments.

tmbdev avatar Aug 31 '22 19:08 tmbdev

@VitalyFedyunin Mind taking a look at this PR?

tadejsv avatar Sep 12 '22 20:09 tadejsv

@VitalyFedyunin Mind taking a look at this PR?

We'll have a look tomorrow or Wednesday. Thanks!

NivekT avatar Sep 13 '22 01:09 NivekT

LGTM for me with @NivekT suggestions applied

VitalyFedyunin avatar Sep 21 '22 16:09 VitalyFedyunin

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Oct 26 '22 13:10 facebook-github-bot