ray icon indicating copy to clipboard operation
ray copied to clipboard

Add script for basic memory pinning CUDA benchmark

Open avivhaber opened this issue 2 years ago • 2 comments

Signed-off-by: Aviv Haber [email protected]

Why are these changes needed?

Script for benchmarking memory pinning on pytorch. The program trains one of two models on CIFAR10 and reports the time taken loading and training.

The first model (small) is a simple CNN described in this tutorial. The second model (large) is the torch VGG11 model.

Usage

The timer starts after the dataset has been loaded and the model has begun training.

To run the benchmark: python torch_test.py --model=small --pin or python torch_test.py --model=small

You can also use model=large.

The --pin option determines whether to use memory pinning.

You can select the dataset with --dataset=imagenet or --dataset=cifar10. The default is CIFAR. If you use imagenet you must also pass --imagenetpath=/path/to/imagenet/root.

You can use --size=512 to transform the images to 512x512, for instance. Only applies to the CIFAR dataset. Imagenet images will always be transformed to 256x256.

Results # 1

Following are typical results (little change between runs). Default 32x32 images, CIFAR10. Code was run on a single g4dn.4xlarge node. BATCH_SIZE=4, NUM_WORKERS=0, EPOCHS=1

Small Model Large Model
Pinning off 41s 469s
Pinning on 41s 466s

There is little to no difference. Maybe this is because the node has enough system memory that it rarely needs to spill to disk when pinning is off. I'm going to run more benchmarks on larger datasets

Results # 2

Using the same CIFAR10 dataset, but transforming the images to be 512x512 (Use --size=512), we get the following results. BATCH_SIZE=100, NUM_WORKERS=8, EPOCHS=1

Small Model Large Model
Pinning off 354s
Pinning on 334s

Here with the larger tensor size, the pinning actually has an effect (~6% difference).

Results # 3

Using a small subset of imagenet and only doing training (no testing). The images were transformed to 256x256. BATCH_SIZE=100, NUM_WORKERS=8, EPOCHS=5

Small Model Large Model
Pinning off 40s 217s
Pinning on 35s 203s

~6% difference for large model, ~13% difference for small model.

avivhaber avatar Sep 14 '22 00:09 avivhaber

Btw can we target the merge to a non-master branch, and use that branch for all the experiments of this project

clarng avatar Sep 14 '22 17:09 clarng

Good findings, later you can also compare this with AIR's data loading such as our benchmark: https://sourcegraph.com/github.com/ray-project/ray/-/blob/release/air_tests/air_benchmarks/workloads/pytorch_training_e2e.py?L51

Note that AIR doesn't use pytorch dataloader and use iter_torch_batches instead, which will behave quite differently compare to pytorch.

jiaodong avatar Sep 20 '22 22:09 jiaodong