data data/benchmarks

Please read through our contribution guide prior to creating your pull request.

Note that there is a section on requirements related to adding a new DataPipe.

Fixes #416

Changes

Added CLI to run various benchmarks on various datasets and measure key metrics

May 19 '22 01:05 msaroufim

May 24 notes

Notes from discussion with Vitaly

The lib-kineto profiler won't split iterator performance by datapipe and will show up as one big block
Load image, transform, batch and collate (collate will look like the biggest one) - measurements will be cumulative
Rotation, collation etc.. all need to happen before passing into data loader (all of these need to be datapipe maps) https://pytorch.org/data/main/torchdata.datapipes.map.html
Batch counting needs to happen before we receive a batch from data loader https://github.com/pytorch/data/blob/411167bbca3b800b3f54d37674e8751e40a80e29/benchmarks/run_benchmark.py#L110
Check if autograd profiler works https://github.com/pytorch/pytorch/blob/master/torch/utils/data/datapipes/_typing.py#L462 - does this work in multiprocessing? fork will make everything disappear

May 31 notes

Pytorch profiler needs to report less things - opened an issue on kineto repo https://github.com/pytorch/kineto/issues/609
Modularize code a bit better so we can create baseline for datapipe/dataset vs dataloaderv1/v2 - make train.py take a generic iterator
Check if large call stack is because of shuffling done by vision
Do scaling after this is done

May 24 '22 18:05 msaroufim

Ok we can now get a trace and have an end to end example working, data loading is not the bottleneck here yet so will keep experimenting

May 25 '22 23:05 msaroufim

Screen Shot 2022-05-25 at 5 28 51 PM

May 26 '22 00:05 msaroufim

Next thing I'd like to try out is including these datasets from torchtext.datasets import amazonfullreview after which we can start warming up GPUs

May 26 '22 15:05 msaroufim

The call graph for data pipe construction is long but it's not the data loading that's the bottleneck here since utilization is just 7%. Need to fix the collation problems and bump up the batch size

Also I finally figured out how to build torchtext from source so can use those datapipes as well https://github.com/pytorch/text/issues/1743

profile

May 27 '22 00:05 msaroufim

Discussion with Vitaly June 21

Will focus on running on mc4 with dataloader v1 on various hardware configurations (SSD, HDD) and use a few starter cloudformation templates to make this easier

Jun 21 '22 22:06 msaroufim

Overall, LGTM with a few comments. Let me know what you additional features you plan to add.

nit: Need copyright headers for .py files

Thanks @NivekT will address all your feedback - As far as new features to add for this PR not much I think there's a bunch of cleanup I need to do

Clean up report into its own dataclass which you can then export to whatever format you want: html, md, csv etc..
Address all your feedback
Some more cleanup

And I think the next PR should be focused around integrating the aws cli into CI where we can benchmark a distributed systems setup per @NicolasHug's request

And after that we can see which of the partner integrations should be added to this setup as weell

Jul 19 '22 22:07 msaroufim

@msaroufim @VitalyFedyunin @NivekT following up on my earlier comments in https://github.com/pytorch/data/issues/416#issuecomment-1164404834 I also have a separate PR (https://github.com/pytorch/vision/pull/6196) that already provides support for the cross-product of:

Distributed Learning (DDP) vs 1-GPU training
Datapipes (with DataLoader or torchdata.dataloader2) vs Iterable datasets (non-DP) vs MapStyle Datasets
Full training procedure or Data-loading only (with or without transforms) or Model training only (generating fake datasets)
Timing of data-loading vs model training
any classification model from torchvision

(It also has FFCV support, but that's less relevant for us here).

Since it's directly adapted from torchvision recipes, it's also a bit closer to the kind of training that users would be doing in the wild.

Do you think it would make sense to join our benchmarking efforts here? I'm happy to provide support if you'd like to collaborate.

CC @nairbv

Jul 21 '22 15:07 NicolasHug

@NicolasHug I am in the processing of going through both setups, running them on our AWS cluster, and identifying the differences. I agree that combining the efforts is the right approach. Let me dig a bit deeper first and I can schedule a meeting for all of us to chat.

Jul 21 '22 15:07 NivekT

@NicolasHug I think the right way to divide this up would be

I work on the infra setup, the benchmark artifact and the benchmark export
I leverage your model training scripts since you're the domain expert

I would like to also eventually do something like pull any of the HF datasets and just benchmark there but I don't believe the datasets there give me sufficient information to create a toy model with the right shapes automatically

But yeah would love to talk

Jul 21 '22 17:07 msaroufim

data data copied to clipboard

data/benchmarks

Changes

data
data copied to clipboard