tvm icon indicating copy to clipboard operation
tvm copied to clipboard

No significant change in iters/sec while comparing cpu vs gpu performance

Open hemantranvir opened this issue 6 years ago • 5 comments

I have installed torch_tvm with cuda/opencl support by enabling the following options: https://github.com/dmlc/tvm/blob/master/cmake/config.cmake#L32 https://github.com/dmlc/tvm/blob/master/cmake/config.cmake#L129 https://github.com/dmlc/tvm/blob/master/cmake/config.cmake#L132

Trying to compare the cpu vs gpu performance by running the following test: https://github.com/pytorch/tvm/blob/master/test/benchmarks.py

  • CPU version:
$ CUDA_VISIBLE_DEVICES='' PYTHONPATH=../:$PYTHONPATH python3 benchmarks.py 

Execution Log:

root@ccf26f0f9541:/opt/work/tvm/test# CUDA_VISIBLE_DEVICES='' PYTHONPATH=../:$PYTHONPATH python3 benchmarks.py 
Tracing model with JIT
Warming JIT up with 10 runs
Running JIT 10 times
Done benchmarking JIT
Tracing model with TVM
WARNING: reshape with -1 as the first value has known incompatibility with PyTorch semantics.
Cannot find config for target=llvm -mcpu=core-avx2, workload=('dense', (1, 512, 'float32'), (125, 512, 8, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.
[08:58:08] /opt/work/tvm/tvm/src/pass/vectorize_loop.cc:389: Detect vector condition in Vectorized Loop, scalarizing...
[08:58:08] /opt/work/tvm/tvm/src/pass/vectorize_loop.cc:389: Detect vector condition in Vectorized Loop, scalarizing...
[08:58:08] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (8 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (4 - (ax0.ax1.outer.fused.ax2.fused/56))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (4 - (ax0.ax1.outer.fused.ax2.fused/56))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (8 - (ax0.ax1.outer.fused.ax2.fused/112))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (2 - (ax0.ax1.outer.fused.ax2.outer.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.outer.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.outer.fused/4))) + 1) >= 0), when generating the post doubt loop
[08:58:09] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((1 - (7 - ((ax0.ax1.outer.fused.ax2.outer.fused % 4)*2))) + 1) >= 0), when generating the post doubt loop
/usr/local/lib/python3.5/dist-packages/torch/jit/__init__.py:1030: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Not within tolerance rtol=1e-05 atol=1e-05 at input[0, 255] (-0.710386335849762 vs. -0.7103500366210938) and 5 other locations (0.00%)
  check_tolerance, _force_outplace, True, _module_class)
Warming TVM up with 10 iters
Running TVM 10 times
Done benchmarking TVM, which compiled 100.00% of compute
JIT: 39.134256974191366 iter/s
TVM: 62.80919757107452 iter/s
root@ccf26f0f9541:/opt/work/tvm/test# 
  • GPU version: Edit L39 of benchmarks.py to torch_tvm.enable(opt_level=3, device_type='cuda')
$ CUDA_VISIBLE_DEVICES='0' PYTHONPATH=../:$PYTHONPATH python3 benchmarks.py 

Execution Log:

root@ccf26f0f9541:/opt/work/tvm/test# CUDA_VISIBLE_DEVICES='0' PYTHONPATH=../:$PYTHONPATH python3 benchmarks.py 
Tracing model with JIT
Warming JIT up with 10 runs
Running JIT 10 times
Done benchmarking JIT
Tracing model with TVM
WARNING: reshape with -1 as the first value has known incompatibility with PyTorch semantics.
Cannot find config for target=llvm -mcpu=core-avx2, workload=('dense', (1, 512, 'float32'), (125, 512, 8, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.
[08:58:43] /opt/work/tvm/tvm/src/pass/vectorize_loop.cc:389: Detect vector condition in Vectorized Loop, scalarizing...
[08:58:43] /opt/work/tvm/tvm/src/pass/vectorize_loop.cc:389: Detect vector condition in Vectorized Loop, scalarizing...
[08:58:43] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:43] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (64 - (ax0.ax1.outer.fused.ax2.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.fused/14))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (8 - (ax0.ax1.outer.fused.ax2.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (4 - (ax0.ax1.outer.fused.ax2.fused/56))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (4 - (ax0.ax1.outer.fused.ax2.fused/56))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (8 - (ax0.ax1.outer.fused.ax2.fused/112))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (2 - (ax0.ax1.outer.fused.ax2.outer.fused/28))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (16 - (ax0.ax1.outer.fused.ax2.outer.fused/7))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((0 - (32 - (ax0.ax1.outer.fused.ax2.outer.fused/4))) + 1) >= 0), when generating the post doubt loop
[08:58:44] /opt/work/tvm/tvm/src/pass/loop_partition.cc:550: Cannot prove: (((1 - (7 - ((ax0.ax1.outer.fused.ax2.outer.fused % 4)*2))) + 1) >= 0), when generating the post doubt loop
/usr/local/lib/python3.5/dist-packages/torch/jit/__init__.py:1030: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Not within tolerance rtol=1e-05 atol=1e-05 at input[0, 255] (-0.710386335849762 vs. -0.7103500366210938) and 5 other locations (0.00%)
  check_tolerance, _force_outplace, True, _module_class)
Warming TVM up with 10 iters
Running TVM 10 times
Done benchmarking TVM, which compiled 100.00% of compute
JIT: 39.478923510188096 iter/s
TVM: 64.52328684937197 iter/s
root@ccf26f0f9541:/opt/work/tvm/test# 

As seen above there is no significant change in iter/s. CPU version: 62.80919757107452 iter/s GPU version: 64.52328684937197 iter/s

If I check the GPU memory usage with nvidia-smi command, as expected, the GPU is idle. Is there any other configuration necessary to enable GPU backend?

(Apart from setting set(USE_CUDA ON) , set(USE_CUDNN ON), set(USE_CUBLAS ON) in https://github.com/dmlc/tvm/blob/master/cmake/config.cmake And setting torch_tvm.enable(opt_level=3, device_type='cuda') in https://github.com/pytorch/tvm/blob/master/test/benchmarks.py)

hemantranvir avatar Nov 01 '19 09:11 hemantranvir

I don't think current integration support CUDA now. But we have something WIP. @ilia-cher

yinghai avatar Nov 01 '19 21:11 yinghai

I have a local patch that adds support for CUDA, eta send it next week

ilia-cher avatar Nov 01 '19 22:11 ilia-cher

@ilia-cher Thanks for your response! If it is not a big modification in the source code, considering the tvm side already has the support for cuda, can you please transcribe the method to enable the cuda support in torch_tvm. Excuse my hastiness, as I am in a bit of a hurry.

hemantranvir avatar Nov 04 '19 06:11 hemantranvir

plan to send cuda support PR this week

ilia-cher avatar Nov 04 '19 23:11 ilia-cher

@ilia-cher Any updates? Thanks!

hemantranvir avatar Nov 11 '19 03:11 hemantranvir