pytorch-image-models [FEATURE] Run benchmark `--model-list` in subprocess

Is your feature request related to a problem? Please describe. While doing benchmark on timm-models with benchmark.py, I tried the following two ways:

python benchmark.py --model-list _models.txt -b 128
python benchmark.py --model xception -b 128

where _models.txt is a text file that contains 65 different models, including xception on the last line.

Due to possible pytorch memory allocation fragmentations, when the first command executes xception model, available GPU memory could be much smaller than the second command's equivalent. This may cause cuDNN convolution to choose a more conservative but slower method.

(I've seen plenty of pytorch CUDA OOM warnings using command 1 on an A100 40GB GPU. 😨 )

As a result, for xception model, on A100, with pytorch source build https://github.com/pytorch/pytorch/commit/cd51d2a3ecc8ac579bee910f6bafe41a4c41ca80, cuda 11.5 + cudnn next version of 8.2.4, I got

Benchmarking in float32 precision. NCHW layout. torchscript disabled
Model xception created, param count: 22855952
Running inference benchmark on xception for 40 steps w/ input size (3, 299, 299) and batch size 128.
Infer [8/40]. 754.69 samples/sec. 169.605 ms/step.
Infer [16/40]. 754.71 samples/sec. 169.601 ms/step.
Infer [24/40]. 754.74 samples/sec. 169.596 ms/step.
Infer [32/40]. 754.74 samples/sec. 169.595 ms/step.
Infer [40/40]. 754.74 samples/sec. 169.595 ms/step.
Inference benchmark of xception done. 754.65 samples/sec, 169.59 ms/step
Model xception created, param count: 22855952
Running train benchmark on xception for 40 steps w/ input size (3, 299, 299) and batch size 128.
Train [8/40]. 267.01 samples/sec. 479.391 ms/step.
Train [16/40]. 266.97 samples/sec. 479.463 ms/step.
Train [24/40]. 266.96 samples/sec. 479.476 ms/step.
Train [32/40]. 266.96 samples/sec. 479.480 ms/step.
Train [40/40]. 266.95 samples/sec. 479.488 ms/step.
Train benchmark of xception done. 266.37 samples/sec, 479.49 ms/sample

Benchmarking in float32 precision. NCHW layout. torchscript disabled
Model xception created, param count: 22855952
Running inference benchmark on xception for 40 steps w/ input size (3, 299, 299) and batch size 128.
Infer [8/40]. 1219.51 samples/sec. 104.960 ms/step.
Infer [16/40]. 1219.10 samples/sec. 104.996 ms/step.
Infer [24/40]. 1219.20 samples/sec. 104.987 ms/step.
Infer [32/40]. 1218.06 samples/sec. 105.085 ms/step.
Infer [40/40]. 1218.27 samples/sec. 105.067 ms/step.
Inference benchmark of xception done. 1217.86 samples/sec, 105.07 ms/step
Model xception created, param count: 22855952
Running train benchmark on xception for 40 steps w/ input size (3, 299, 299) and batch size 128.
Train [8/40]. 308.69 samples/sec. 414.653 ms/step.
Train [16/40]. 308.69 samples/sec. 414.651 ms/step.
Train [24/40]. 308.70 samples/sec. 414.648 ms/step.
Train [32/40]. 308.69 samples/sec. 414.654 ms/step.
Train [40/40]. 308.69 samples/sec. 414.655 ms/step.
Train benchmark of xception done. 307.78 samples/sec, 414.65 ms/sample

It is seen that the first command gives smaller throughput than the second command.

Describe the solution you'd like

Run benchmark --model-list in subprocess. A bash script with for-loop doing python benchmark.py --model x -b 128 would already be good.

A subprocess guarantees GPU memory recycling. Though it's slower and introduces overhead, several seconds of overhead is not a big problem because it introduces more accurate benchmark.

The better way would be doing this in the benchmark.py script natively. Somehow, doing --model with a single mode doesn't create a benchmark.csv file that summarizes the results. I will need to collect it manually if using the bash for-loop solution.

Describe alternatives you've considered

I tried to add code here but it doesn't help.

    gc.collect()
    torch.cuda.empty_cache()

https://github.com/rwightman/pytorch-image-models/blob/aaff2d82d06109703a06aca2c0c20815b9d46cbb/benchmark.py#L493-L494

Additional context

This issue started to appear since around 10/22/21. It might be changes from pytorch or cuda libraries. Since using subprocess resolves this issue, it's beneficial to have that.

Timm-models is git clone'd from github at a41de1f666f9187e70845bbcf5b092f40acaf097

cc @ptrblck

Nov 03 '21 02:11 xwang233

@xwang233 it's definitely related to specific pytorch and cuda / cudnn releases, I have to stick to ones that work to get through without issues, recently I found that the conda PyTorch 1.10 release was pretty decent for a cuda 11.x release, the cuda 10.x were generally quite a bit more reliable for this sort of use case (I do the same thing batch validating a large amount of models for my results csv files and could only make it through on the cuda 10.x wheels of PyTorch 1.9 and 1.8).

I do agree though, for reliability I should do each model in a separate process so I don't screw up the cuda context. Actually I need to do this for PyTorch XLA (TPU) benchmarking soon so I do plan to come up with a way of doing this for validate.py and benchmark.py

EDIT: I should also add, doing both inference and train is quite a bit more challenging to run through lots of models without hitting slowdowns or unrecoverable errors...

Nov 03 '21 03:11 rwightman

bulk_runner.py does this, been using it for mass benchmark and validation for a while

Feb 02 '23 04:02 rwightman