ray
ray copied to clipboard
Add script for basic memory pinning CUDA benchmark
Signed-off-by: Aviv Haber [email protected]
Why are these changes needed?
Script for benchmarking memory pinning on pytorch. The program trains one of two models on CIFAR10 and reports the time taken loading and training.
The first model (small) is a simple CNN described in this tutorial. The second model (large) is the torch VGG11 model.
Usage
The timer starts after the dataset has been loaded and the model has begun training.
To run the benchmark:
python torch_test.py --model=small --pin
or python torch_test.py --model=small
You can also use model=large
.
The --pin
option determines whether to use memory pinning.
You can select the dataset with --dataset=imagenet
or --dataset=cifar10
. The default is CIFAR. If you use imagenet you must also pass --imagenetpath=/path/to/imagenet/root
.
You can use --size=512
to transform the images to 512x512, for instance. Only applies to the CIFAR dataset. Imagenet images will always be transformed to 256x256.
Results # 1
Following are typical results (little change between runs). Default 32x32 images, CIFAR10.
Code was run on a single g4dn.4xlarge node. BATCH_SIZE=4, NUM_WORKERS=0, EPOCHS=1
Small Model | Large Model | |
---|---|---|
Pinning off | 41s | 469s |
Pinning on | 41s | 466s |
There is little to no difference. Maybe this is because the node has enough system memory that it rarely needs to spill to disk when pinning is off. I'm going to run more benchmarks on larger datasets
Results # 2
Using the same CIFAR10 dataset, but transforming the images to be 512x512 (Use --size=512
), we get the following results. BATCH_SIZE=100, NUM_WORKERS=8, EPOCHS=1
Small Model | Large Model | |
---|---|---|
Pinning off | 354s | |
Pinning on | 334s |
Here with the larger tensor size, the pinning actually has an effect (~6% difference).
Results # 3
Using a small subset of imagenet and only doing training (no testing). The images were transformed to 256x256. BATCH_SIZE=100, NUM_WORKERS=8, EPOCHS=5
Small Model | Large Model | |
---|---|---|
Pinning off | 40s | 217s |
Pinning on | 35s | 203s |
~6% difference for large model, ~13% difference for small model.
Btw can we target the merge to a non-master branch, and use that branch for all the experiments of this project
Good findings, later you can also compare this with AIR's data loading such as our benchmark: https://sourcegraph.com/github.com/ray-project/ray/-/blob/release/air_tests/air_benchmarks/workloads/pytorch_training_e2e.py?L51
Note that AIR doesn't use pytorch dataloader and use iter_torch_batches
instead, which will behave quite differently compare to pytorch.