ROMP Low performance, but low hardware utilization

Problem

Simple ROMP has very poor performance on my machine:

around 10 FPS (standalone: romp --mode=webcam --show -t)
around 7 FPS (as an module: from romp import ROMP)

But my hardware utilization on my GPU with CUDA is still low:

Steps to reproduce

conda create -n romp python=3.10
conda activate romp
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
pip install simple-romp cython
romp --mode=webcam --show -t

Aug 15 '22 14:08 vivi90

Fixed the 7 FPS issue:

Used it in combination with my vmcp package and it's vmcp.osc.backend.osc4py3.as_comthreads OSC backend. But yeah, this also uses threading so there was an performance loss. Fixed it by using the vmcp.osc.backend.osc4py3.as_eventloop backend instead and running vmcp.osc.channel.Sender.system.run() after every vmcp.osc.channel.Sender.send()

But the inefficient hardware usage still causes around 10 FPS.

Have run the torch.utils.bottleneck profiler over my script while running each test 10 predictions:

--------------------------------------------------------------------------------
  cProfile output
--------------------------------------------------------------------------------
         3696464 function calls (3485118 primitive calls) in 7.879 seconds

   Ordered by: internal time
   List reduced from 3133 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     3100    2.324    0.001    2.324    0.001 {built-in method torch.conv2d}
        1    0.954    0.954    7.881    7.881 romp_vmcp.py:1(<module>)
       10    0.806    0.081    3.650    0.365 D:\miniconda3\envs\romp\lib\site-packages\romp\model.py:382(forward)
     1097    0.364    0.000    0.727    0.001 D:\miniconda3\envs\romp\lib\site-packages\torch\nn\modules\module.py:1440(_load_from_state_dict)
       10    0.351    0.035    0.351    0.035 {method 'read' of 'cv2.VideoCapture' objects}
  2048976    0.272    0.000    0.272    0.000 {method 'startswith' of 'str' objects}
     3070    0.208    0.000    0.208    0.000 {built-in method torch.batch_norm}
      316    0.126    0.000    0.126    0.000 {method 'uniform_' of 'torch._C._TensorBase' objects}
     2760    0.107    0.000    0.107    0.000 {built-in method torch.relu_}
     1853    0.095    0.000    0.095    0.000 {method 'copy_' of 'torch._C.StorageBase' objects}
     3772    0.092    0.000    0.092    0.000 {method 'to' of 'torch._C._TensorBase' objects}
     1853    0.090    0.000    0.203    0.000 D:\miniconda3\envs\romp\lib\site-packages\torch\_utils.py:48(_cuda)
       30    0.080    0.003    0.114    0.004 D:\miniconda3\envs\romp\lib\site-packages\romp\utils.py:606(rotation_matrix_to_quaternion)
     1851    0.079    0.000    0.079    0.000 {method 'copy_' of 'torch._C._TensorBase' objects}
184480/21960    0.075    0.000    0.079    0.000 D:\miniconda3\envs\romp\lib\site-packages\torch\nn\modules\module.py:1775(named_modules)

--------------------------------------------------------------------------------
  autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total
------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                    Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
------------------------  ------------  ------------  ------------  ------------  ------------  ------------
    DataParallel.forward        10.19%      57.587ms        21.09%     119.144ms     119.144ms             1
    DataParallel.forward        10.10%      57.059ms        20.87%     117.934ms     117.934ms             1
    DataParallel.forward        10.00%      56.507ms        20.55%     116.128ms     116.128ms             1
    DataParallel.forward         9.82%      55.509ms        20.41%     115.327ms     115.327ms             1
    DataParallel.forward         9.80%      55.348ms        20.36%     115.011ms     115.011ms             1
    DataParallel.forward         9.58%      54.141ms        20.33%     114.857ms     114.857ms             1
    DataParallel.forward         9.59%      54.172ms        20.10%     113.549ms     113.549ms             1
    DataParallel.forward         9.70%      54.788ms        20.04%     113.232ms     113.232ms             1
    DataParallel.forward         9.52%      53.783ms        20.04%     113.219ms     113.219ms             1
    DataParallel.forward         9.54%      53.874ms        19.99%     112.923ms     112.923ms             1
          aten::uniform_         0.44%       2.462ms         0.44%       2.462ms       2.462ms             1
          aten::uniform_         0.44%       2.459ms         0.44%       2.459ms       2.459ms             1
          aten::uniform_         0.43%       2.441ms         0.43%       2.441ms       2.441ms             1
          aten::uniform_         0.43%       2.437ms         0.43%       2.437ms       2.437ms             1
          aten::uniform_         0.43%       2.417ms         0.43%       2.417ms       2.417ms             1
------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 564.984ms

--------------------------------------------------------------------------------
  autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

        Because the autograd profiler uses the CUDA event API,
        the CUDA time column reports approximately max(cuda_time, cpu_time).
        Please ignore this output if your code does not use CUDA.
------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                    Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
    DataParallel.forward        10.50%      68.836ms        22.29%     146.068ms     146.068ms       2.030ms         9.64%     137.153ms     137.153ms             1
    DataParallel.forward        10.39%      68.062ms        22.10%     144.827ms     144.827ms       2.673ms        12.69%     132.493ms     132.493ms             1
    DataParallel.forward        10.12%      66.321ms        21.33%     139.765ms     139.765ms       2.053ms         9.74%     129.801ms     129.801ms             1
    DataParallel.forward         9.93%      65.093ms        21.14%     138.530ms     138.530ms       2.056ms         9.76%     129.635ms     129.635ms             1
    DataParallel.forward         9.87%      64.702ms        21.06%     138.023ms     138.023ms       2.035ms         9.66%     128.844ms     128.844ms             1
    DataParallel.forward         9.78%      64.101ms        21.05%     137.950ms     137.950ms       2.035ms         9.66%     129.586ms     129.586ms             1
    DataParallel.forward         9.53%      62.471ms        20.71%     135.708ms     135.708ms       2.016ms         9.57%     127.760ms     127.760ms             1
    DataParallel.forward         9.77%      64.044ms        20.70%     135.675ms     135.675ms       2.064ms         9.80%     125.398ms     125.398ms             1
    DataParallel.forward         9.43%      61.809ms        20.21%     132.436ms     132.436ms       2.032ms         9.64%     124.054ms     124.054ms             1
    DataParallel.forward         9.51%      62.324ms        20.20%     132.358ms     132.358ms       2.064ms         9.80%     122.687ms     122.687ms             1
          aten::uniform_         0.39%       2.566ms         0.39%       2.566ms       2.566ms       1.000us         0.00%       1.000us       1.000us             1
          aten::uniform_         0.39%       2.534ms         0.39%       2.534ms       2.534ms       1.000us         0.00%       1.000us       1.000us             1
          aten::uniform_         0.38%       2.485ms         0.38%       2.485ms       2.485ms       1.000us         0.00%       1.000us       1.000us             1
                aten::to         0.00%       7.000us         0.37%       2.423ms       2.423ms       3.000us         0.01%       3.809ms       3.809ms             1
          aten::_to_copy         0.00%      29.000us         0.37%       2.416ms       2.416ms       5.000us         0.02%       3.806ms       3.806ms             1
------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 655.384ms
Self CUDA time total: 21.069ms

Not exactly sure, what this means. But to me it seems like the ROMP implementation is slowed down by too much communication overhead.

Configuration

GPU: 0
onnx: false
smooth_coeff: 1
temporal_optimize: true

Aug 15 '22 17:08 vivi90

@Arthur151 Do you have any ideas for optimization of ROMP to speed it up? 🙂

Aug 15 '22 17:08 vivi90

Btw, i need to mention, that i was not able to test ONNX with CUDA, because of: https://github.com/Arthur151/ROMP/issues/336

Aug 15 '22 17:08 vivi90

@vivi90 Hi, Vivian, Yes, I see your question. It happened to my colleague too. ROMP runs over 25/50 FPS on my 1070Ti/3090Ti, but only runs about 20 FPS on my colleague's 3090 server. I haven't found the reason why causes this problem. But I guess the reason might some essential acceleration libraries I installed but my colleague didn't. I haven't determined which lib it is. Sorry to say that.

Aug 27 '22 10:08 Arthur151

@Arthur151 Hey,请问成功修复了BUG吗。无论是romp还是BEV，我在4080上也只能跑20帧左右。

Mar 12 '23 05:03 JunfengLiu1

@JunfengLiu1

Hey, did you successfully fix the BUG? Whether it is romp or BEV, I can only run about 20 frames on the 4080.

Please share the following information with us:

Used operating system
Used python version
Used CUDA version
ROMP & BEV configuration settings
Profiling reports (https://pytorch.org/docs/stable/bottleneck.html)

Mar 12 '23 17:03 vivi90

@vivi90

Used operating system ubuntu18.04
Used python version 3.7
Used CUDA version 11.4
ROMP & BEV configuration settings 我没有修改主文件中的任何参数； romp:romp --mode=webcam --show 这是结果： bev:bev --mode=webcam --show 结果：问题是它时不时（大约20%）的时间会变成10帧。（romp也是）
Profiling reports 这是bev --mode=webcam --show的结果，检测300帧时停止运行。
环境： certifi 2022.12.7 commonmark 0.9.1 cycler 0.11.0 Cython 0.29.33 cython-bbox 0.1.3 filterpy 1.4.5 fonttools 4.38.0 importlib-metadata 4.8.3 kiwisolver 1.4.4 lap 0.4.0 matplotlib 3.5.3 norfair 2.2.0 numpy 1.21.6 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 opencv-python 4.7.0.72 packaging 23.0 Pillow 9.4.0 pip 22.3.1 Pygments 2.14.0 pyparsing 3.0.9 PySocks 1.7.1 python-dateutil 2.8.2 rich 12.6.0 scipy 1.7.3 setuptools 67.6.0 simple-romp 1.0.8 six 1.16.0 torch 1.13.1 typing_extensions 4.5.0 wget 3.2 wheel 0.38.4 zipp 3.15.0 我使用pip install --upgrade simple_romp进行安装，它默认安装了nvidia-cuda-nvrtc-cu11 11.7.99，而我本地的cuda是11.4，可能是这个问题？

Mar 13 '23 07:03 JunfengLiu1

I changed my cuda to 11.8,but in BEV it still ran about 15-20fps,and always down to 10fps

Mar 13 '23 12:03 JunfengLiu1

ROMP ROMP copied to clipboard

Low performance, but low hardware utilization

Problem

Steps to reproduce

Fixed the 7 FPS issue:

But the inefficient hardware usage still causes around 10 FPS.

Configuration

ROMP
ROMP copied to clipboard