Kevin Tse
Kevin Tse
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff [on Phabricator](https://www.internalfb.com/diff/D38675266).
@nivekt has imported this pull request. If you are a Facebook employee, you can view this diff [on Phabricator](https://www.internalfb.com/diff/D38675266).
> What is a good canonical way to shuffle intra and inter archives? I think the best way is to use [`in_batch_shuffle`](https://github.com/pytorch/data/blob/main/torchdata/datapipes/iter/transform/bucketbatcher.py#L19-L47). Though we need to add further randomness control...
I share the same concern as @ejguan regarding option 1 since we need the linter to work during runtime. I personally prefer Option 2, specially the Global based rather than...
Edit: I think the new operation [`flatten` in this open PR](https://github.com/pytorch/data/blob/da5e6493c8cc2f040f6a54487f228052dabb131e/torchdata/datapipes/iter/transform/callable.py#L289) should be able to handle a IterDataPipe of iterables, depending on how we end up implementing that.
> @VitalyFedyunin Mind taking a look at this PR? We'll have a look tomorrow or Wednesday. Thanks!
This can be closed as #837 has landed
I am going to take a quick look into `fsspec` vs `s3` performance in my benchmark
My benchmark shows that using `FSSpecFileOpener` is faster and it also provides the ability to stream (rather than downloading a whole archive into memory before reading).
Since #812 and #836 have landed, I believe users should be able to use GCP and Azure Blob storage. Please feel free to re-open this issue or open a new...