ROMP
ROMP copied to clipboard
Low performance, but low hardware utilization
Problem
Simple ROMP has very poor performance on my machine:
- around 10 FPS (standalone:
romp --mode=webcam --show -t) - around 7 FPS (as an module:
from romp import ROMP)
But my hardware utilization on my GPU with CUDA is still low:

Steps to reproduce
conda create -n romp python=3.10
conda activate romp
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
pip install simple-romp cython
romp --mode=webcam --show -t
Fixed the 7 FPS issue:
Used it in combination with my vmcp package and it's vmcp.osc.backend.osc4py3.as_comthreads OSC backend.
But yeah, this also uses threading so there was an performance loss.
Fixed it by using the vmcp.osc.backend.osc4py3.as_eventloop backend instead and running vmcp.osc.channel.Sender.system.run() after every vmcp.osc.channel.Sender.send()
But the inefficient hardware usage still causes around 10 FPS.
Have run the torch.utils.bottleneck profiler over my script while running each test 10 predictions:
--------------------------------------------------------------------------------
cProfile output
--------------------------------------------------------------------------------
3696464 function calls (3485118 primitive calls) in 7.879 seconds
Ordered by: internal time
List reduced from 3133 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
3100 2.324 0.001 2.324 0.001 {built-in method torch.conv2d}
1 0.954 0.954 7.881 7.881 romp_vmcp.py:1(<module>)
10 0.806 0.081 3.650 0.365 D:\miniconda3\envs\romp\lib\site-packages\romp\model.py:382(forward)
1097 0.364 0.000 0.727 0.001 D:\miniconda3\envs\romp\lib\site-packages\torch\nn\modules\module.py:1440(_load_from_state_dict)
10 0.351 0.035 0.351 0.035 {method 'read' of 'cv2.VideoCapture' objects}
2048976 0.272 0.000 0.272 0.000 {method 'startswith' of 'str' objects}
3070 0.208 0.000 0.208 0.000 {built-in method torch.batch_norm}
316 0.126 0.000 0.126 0.000 {method 'uniform_' of 'torch._C._TensorBase' objects}
2760 0.107 0.000 0.107 0.000 {built-in method torch.relu_}
1853 0.095 0.000 0.095 0.000 {method 'copy_' of 'torch._C.StorageBase' objects}
3772 0.092 0.000 0.092 0.000 {method 'to' of 'torch._C._TensorBase' objects}
1853 0.090 0.000 0.203 0.000 D:\miniconda3\envs\romp\lib\site-packages\torch\_utils.py:48(_cuda)
30 0.080 0.003 0.114 0.004 D:\miniconda3\envs\romp\lib\site-packages\romp\utils.py:606(rotation_matrix_to_quaternion)
1851 0.079 0.000 0.079 0.000 {method 'copy_' of 'torch._C._TensorBase' objects}
184480/21960 0.075 0.000 0.079 0.000 D:\miniconda3\envs\romp\lib\site-packages\torch\nn\modules\module.py:1775(named_modules)
--------------------------------------------------------------------------------
autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
------------------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
------------------------ ------------ ------------ ------------ ------------ ------------ ------------
DataParallel.forward 10.19% 57.587ms 21.09% 119.144ms 119.144ms 1
DataParallel.forward 10.10% 57.059ms 20.87% 117.934ms 117.934ms 1
DataParallel.forward 10.00% 56.507ms 20.55% 116.128ms 116.128ms 1
DataParallel.forward 9.82% 55.509ms 20.41% 115.327ms 115.327ms 1
DataParallel.forward 9.80% 55.348ms 20.36% 115.011ms 115.011ms 1
DataParallel.forward 9.58% 54.141ms 20.33% 114.857ms 114.857ms 1
DataParallel.forward 9.59% 54.172ms 20.10% 113.549ms 113.549ms 1
DataParallel.forward 9.70% 54.788ms 20.04% 113.232ms 113.232ms 1
DataParallel.forward 9.52% 53.783ms 20.04% 113.219ms 113.219ms 1
DataParallel.forward 9.54% 53.874ms 19.99% 112.923ms 112.923ms 1
aten::uniform_ 0.44% 2.462ms 0.44% 2.462ms 2.462ms 1
aten::uniform_ 0.44% 2.459ms 0.44% 2.459ms 2.459ms 1
aten::uniform_ 0.43% 2.441ms 0.43% 2.441ms 2.441ms 1
aten::uniform_ 0.43% 2.437ms 0.43% 2.437ms 2.437ms 1
aten::uniform_ 0.43% 2.417ms 0.43% 2.417ms 2.417ms 1
------------------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 564.984ms
--------------------------------------------------------------------------------
autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
Because the autograd profiler uses the CUDA event API,
the CUDA time column reports approximately max(cuda_time, cpu_time).
Please ignore this output if your code does not use CUDA.
------------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
------------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
DataParallel.forward 10.50% 68.836ms 22.29% 146.068ms 146.068ms 2.030ms 9.64% 137.153ms 137.153ms 1
DataParallel.forward 10.39% 68.062ms 22.10% 144.827ms 144.827ms 2.673ms 12.69% 132.493ms 132.493ms 1
DataParallel.forward 10.12% 66.321ms 21.33% 139.765ms 139.765ms 2.053ms 9.74% 129.801ms 129.801ms 1
DataParallel.forward 9.93% 65.093ms 21.14% 138.530ms 138.530ms 2.056ms 9.76% 129.635ms 129.635ms 1
DataParallel.forward 9.87% 64.702ms 21.06% 138.023ms 138.023ms 2.035ms 9.66% 128.844ms 128.844ms 1
DataParallel.forward 9.78% 64.101ms 21.05% 137.950ms 137.950ms 2.035ms 9.66% 129.586ms 129.586ms 1
DataParallel.forward 9.53% 62.471ms 20.71% 135.708ms 135.708ms 2.016ms 9.57% 127.760ms 127.760ms 1
DataParallel.forward 9.77% 64.044ms 20.70% 135.675ms 135.675ms 2.064ms 9.80% 125.398ms 125.398ms 1
DataParallel.forward 9.43% 61.809ms 20.21% 132.436ms 132.436ms 2.032ms 9.64% 124.054ms 124.054ms 1
DataParallel.forward 9.51% 62.324ms 20.20% 132.358ms 132.358ms 2.064ms 9.80% 122.687ms 122.687ms 1
aten::uniform_ 0.39% 2.566ms 0.39% 2.566ms 2.566ms 1.000us 0.00% 1.000us 1.000us 1
aten::uniform_ 0.39% 2.534ms 0.39% 2.534ms 2.534ms 1.000us 0.00% 1.000us 1.000us 1
aten::uniform_ 0.38% 2.485ms 0.38% 2.485ms 2.485ms 1.000us 0.00% 1.000us 1.000us 1
aten::to 0.00% 7.000us 0.37% 2.423ms 2.423ms 3.000us 0.01% 3.809ms 3.809ms 1
aten::_to_copy 0.00% 29.000us 0.37% 2.416ms 2.416ms 5.000us 0.02% 3.806ms 3.806ms 1
------------------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 655.384ms
Self CUDA time total: 21.069ms
Not exactly sure, what this means. But to me it seems like the ROMP implementation is slowed down by too much communication overhead.
Configuration
GPU: 0
onnx: false
smooth_coeff: 1
temporal_optimize: true
@Arthur151 Do you have any ideas for optimization of ROMP to speed it up? 🙂
Btw, i need to mention, that i was not able to test ONNX with CUDA, because of: https://github.com/Arthur151/ROMP/issues/336
@vivi90 Hi, Vivian, Yes, I see your question. It happened to my colleague too. ROMP runs over 25/50 FPS on my 1070Ti/3090Ti, but only runs about 20 FPS on my colleague's 3090 server. I haven't found the reason why causes this problem. But I guess the reason might some essential acceleration libraries I installed but my colleague didn't. I haven't determined which lib it is. Sorry to say that.
@Arthur151 Hey,请问成功修复了BUG吗。无论是romp还是BEV,我在4080上也只能跑20帧左右。
@JunfengLiu1
Hey, did you successfully fix the BUG? Whether it is romp or BEV, I can only run about 20 frames on the 4080.
Please share the following information with us:
- Used operating system
- Used python version
- Used CUDA version
- ROMP & BEV configuration settings
- Profiling reports (https://pytorch.org/docs/stable/bottleneck.html)
@vivi90
- Used operating system ubuntu18.04
- Used python version 3.7
- Used CUDA version 11.4
- ROMP & BEV configuration settings
我没有修改主文件中的任何参数;
romp:
romp --mode=webcam --show这是结果:
bev:bev --mode=webcam --show结果:
问题是它时不时(大约20%)的时间会变成10帧。(romp也是) - Profiling reports
这是
bev --mode=webcam --show的结果,检测300帧时停止运行。

- 环境:
certifi 2022.12.7
commonmark 0.9.1
cycler 0.11.0
Cython 0.29.33
cython-bbox 0.1.3
filterpy 1.4.5
fonttools 4.38.0
importlib-metadata 4.8.3
kiwisolver 1.4.4
lap 0.4.0
matplotlib 3.5.3
norfair 2.2.0
numpy 1.21.6
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
opencv-python 4.7.0.72
packaging 23.0
Pillow 9.4.0
pip 22.3.1
Pygments 2.14.0
pyparsing 3.0.9
PySocks 1.7.1
python-dateutil 2.8.2
rich 12.6.0
scipy 1.7.3
setuptools 67.6.0
simple-romp 1.0.8
six 1.16.0
torch 1.13.1
typing_extensions 4.5.0
wget 3.2
wheel 0.38.4
zipp 3.15.0
我使用
pip install --upgrade simple_romp进行安装,它默认安装了nvidia-cuda-nvrtc-cu11 11.7.99,而我本地的cuda是11.4,可能是这个问题?
I changed my cuda to 11.8,but in BEV it still ran about 15-20fps,and always down to 10fps