ray Add script for basic memory pinning CUDA benchmark

Signed-off-by: Aviv Haber [email protected]

Why are these changes needed?

Script for benchmarking memory pinning on pytorch. The program trains one of two models on CIFAR10 and reports the time taken loading and training.

The first model (small) is a simple CNN described in this tutorial. The second model (large) is the torch VGG11 model.

Usage

The timer starts after the dataset has been loaded and the model has begun training.

To run the benchmark: python torch_test.py --model=small --pin or python torch_test.py --model=small

You can also use model=large.

The --pin option determines whether to use memory pinning.

You can select the dataset with --dataset=imagenet or --dataset=cifar10. The default is CIFAR. If you use imagenet you must also pass --imagenetpath=/path/to/imagenet/root.

You can use --size=512 to transform the images to 512x512, for instance. Only applies to the CIFAR dataset. Imagenet images will always be transformed to 256x256.

Results # 1

Following are typical results (little change between runs). Default 32x32 images, CIFAR10. Code was run on a single g4dn.4xlarge node. BATCH_SIZE=4, NUM_WORKERS=0, EPOCHS=1

	Small Model	Large Model
Pinning off	41s	469s
Pinning on	41s	466s

There is little to no difference. Maybe this is because the node has enough system memory that it rarely needs to spill to disk when pinning is off. I'm going to run more benchmarks on larger datasets

Results # 2

Using the same CIFAR10 dataset, but transforming the images to be 512x512 (Use --size=512), we get the following results. BATCH_SIZE=100, NUM_WORKERS=8, EPOCHS=1

	Small Model	Large Model
Pinning off	354s
Pinning on	334s

Here with the larger tensor size, the pinning actually has an effect (~6% difference).

Results # 3

Using a small subset of imagenet and only doing training (no testing). The images were transformed to 256x256. BATCH_SIZE=100, NUM_WORKERS=8, EPOCHS=5

	Small Model	Large Model
Pinning off	40s	217s
Pinning on	35s	203s

~6% difference for large model, ~13% difference for small model.

Sep 14 '22 00:09 avivhaber

Btw can we target the merge to a non-master branch, and use that branch for all the experiments of this project

Sep 14 '22 17:09 clarng

Good findings, later you can also compare this with AIR's data loading such as our benchmark: https://sourcegraph.com/github.com/ray-project/ray/-/blob/release/air_tests/air_benchmarks/workloads/pytorch_training_e2e.py?L51

Note that AIR doesn't use pytorch dataloader and use iter_torch_batches instead, which will behave quite differently compare to pytorch.

Sep 20 '22 22:09 jiaodong

ray ray copied to clipboard

Add script for basic memory pinning CUDA benchmark

Why are these changes needed?

Usage

Results # 1

Results # 2

Results # 3

ray
ray copied to clipboard