data
data copied to clipboard
data/benchmarks
Please read through our contribution guide prior to creating your pull request.
- Note that there is a section on requirements related to adding a new DataPipe.
Fixes #416
Changes
- Added CLI to run various benchmarks on various datasets and measure key metrics
May 24 notes
Notes from discussion with Vitaly
- The lib-kineto profiler won't split iterator performance by datapipe and will show up as one big block
- Load image, transform, batch and collate (collate will look like the biggest one) - measurements will be cumulative
- Rotation, collation etc.. all need to happen before passing into data loader (all of these need to be datapipe maps) https://pytorch.org/data/main/torchdata.datapipes.map.html
- Batch counting needs to happen before we receive a batch from data loader https://github.com/pytorch/data/blob/411167bbca3b800b3f54d37674e8751e40a80e29/benchmarks/run_benchmark.py#L110
- Check if autograd profiler works https://github.com/pytorch/pytorch/blob/master/torch/utils/data/datapipes/_typing.py#L462 - does this work in multiprocessing?
fork
will make everything disappear
May 31 notes
- Pytorch profiler needs to report less things - opened an issue on kineto repo https://github.com/pytorch/kineto/issues/609
- Modularize code a bit better so we can create baseline for datapipe/dataset vs dataloaderv1/v2 - make train.py take a generic iterator
- Check if large call stack is because of shuffling done by vision
- Do scaling after this is done
Ok we can now get a trace and have an end to end example working, data loading is not the bottleneck here yet so will keep experimenting
Next thing I'd like to try out is including these datasets from torchtext.datasets import amazonfullreview
after which we can start warming up GPUs
The call graph for data pipe construction is long but it's not the data loading that's the bottleneck here since utilization is just 7%. Need to fix the collation problems and bump up the batch size
Also I finally figured out how to build torchtext from source so can use those datapipes as well https://github.com/pytorch/text/issues/1743

Discussion with Vitaly June 21
- Will focus on running on mc4 with dataloader v1 on various hardware configurations (SSD, HDD) and use a few starter cloudformation templates to make this easier
Overall, LGTM with a few comments. Let me know what you additional features you plan to add.
nit: Need copyright headers for
.py
files
Thanks @NivekT will address all your feedback - As far as new features to add for this PR not much I think there's a bunch of cleanup I need to do
- Clean up report into its own dataclass which you can then export to whatever format you want: html, md, csv etc..
- Address all your feedback
- Some more cleanup
And I think the next PR should be focused around integrating the aws cli into CI where we can benchmark a distributed systems setup per @NicolasHug's request
And after that we can see which of the partner integrations should be added to this setup as weell
@msaroufim @VitalyFedyunin @NivekT following up on my earlier comments in https://github.com/pytorch/data/issues/416#issuecomment-1164404834 I also have a separate PR (https://github.com/pytorch/vision/pull/6196) that already provides support for the cross-product of:
- Distributed Learning (DDP) vs 1-GPU training
- Datapipes (with DataLoader or
torchdata.dataloader2
) vs Iterable datasets (non-DP) vs MapStyle Datasets - Full training procedure or Data-loading only (with or without transforms) or Model training only (generating fake datasets)
- Timing of data-loading vs model training
- any classification model from torchvision
(It also has FFCV support, but that's less relevant for us here).
Since it's directly adapted from torchvision recipes, it's also a bit closer to the kind of training that users would be doing in the wild.
Do you think it would make sense to join our benchmarking efforts here? I'm happy to provide support if you'd like to collaborate.
CC @nairbv
@NicolasHug I am in the processing of going through both setups, running them on our AWS cluster, and identifying the differences. I agree that combining the efforts is the right approach. Let me dig a bit deeper first and I can schedule a meeting for all of us to chat.
@NicolasHug I think the right way to divide this up would be
- I work on the infra setup, the benchmark artifact and the benchmark export
- I leverage your model training scripts since you're the domain expert
I would like to also eventually do something like pull any of the HF datasets and just benchmark there but I don't believe the datasets there give me sufficient information to create a toy model with the right shapes automatically
But yeah would love to talk