metrabs
metrabs copied to clipboard
Metrabs TensorRT
Hey István,
congrats on these great results and thanks for providing an easy-to-use way to run your models, exceptional work :)
I really like the result I get and just like everyone else in the issues, I would like to run it in real-time.
My approach was to squeeze out some speed-ups using TensorRT and its new tf-trt
capability. At least for the resnet-style models, I'd expect a speed-up on the order of 10x. According to Nvidia the same should hold true for efficientnet-type models.
A tensorflow SavedModel
can directly be optimized and converted into a TensorRT
model using just a few lines of code:
import tensorflow as tf
from tensorflow.python.compiler.tensorrt import trt_convert as trt
converter = trt.TrtGraphConverterV2(input_saved_model_dir='models/eff2s_y4_short_sig')
converter.convert()
converter.save('models/eff2s_y4_trt')
In order for this conversion to know what to do, a default signature needs to be defined. This can be achieved with the following:
import tensorflow as tf
model_folder = 'models/metrabs_eff2s_y4/'
out_fold='models/eff2s_y4_short_sig'
model = tf.saved_model.load(model_folder)
@tf.function()
def my_predict(my_prediction_inputs, **kwargs):
prediction = model.detect_poses(my_prediction_inputs)
return {"prediction": prediction['poses3d']}
my_signatures = my_predict.get_concrete_function(
my_prediction_inputs=tf.TensorSpec([None,None, 3], dtype=tf.dtypes.uint8, name="image"))
tf.saved_model.save(model, out_fold, signatures=my_signatures)
(coincidentally, this might be a solution to the tensorflow-lite question in the issues? I haven't tried it, but just a hunch.)
Unfortunately, the conversion segfaults :D I know that this is rather an issue on Nvidia's side, but maybe we can still get this to work. I suspect that the augmentations you perform on the model in the Packaging Model section of your readme might be throwing tf-trt
off.
Next, I tried to investigate this issue a little further by trying to look under the hood of the packaged SavedModel
. I used
tensorflow's import_pb_to_tensorboard.py
and tried to inspect the result in tensorboard.
$ python import_pb_to_tensorboard.py --model_dir models/eff2s_y4_short_sig/saved_model.pb --log_dir log
$ tensorboard --logdir log
Unfortunately again, tensorboard was not capable of displaying the computation graph and I suspect the reason is again the usage of tf.functions
, but I am not sure.
What I would like to try is to convert one of your trained metrabs-models into TensorRT
and take a look at the speed-up. Would it be possible for you to share a checkpoint file? or the un-augmented SavedModel
as exported here: https://github.com/isarandi/metrabs/blob/master/src/main.py#L242 ? Maybe for metrabs_eff2l_y4
, metrabs_eff2s_y4
, metrabs_rn152_y4
, and metrabs_rn18_y4
to see and compare how backbone and depth affect the inference time?
I run experiments on an RTX3090 and timed the following from a video with 1 person running on a treadmill (1000 frames = 33s) Just like in issue #25, here some current timings for varying batchsizes:
bbone | det | bs | load | full_time | tfirst (batch) | tmean (batch) | t/sample |
---|---|---|---|---|---|---|---|
eff2l | y4 | 1 | 79.35 | 82.27 | 33221.14 | 149.97 | 149.97 |
eff2l | y4 | 8 | 73.74 | 108.03 | 31397.86 | 279.74 | 34.97 |
eff2l | y4 | 16 | 73.74 | 75.5 | 466.06 | 480.87 | 30.05 |
eff2l | y4 | 32 | 79.21 | 82.64 | 938.29 | 885.90 | 27.68 |
eff2l | y4 | 64 | 79.21 | 93.92 | 1853.03 | 1801.76 | 28.15 |
eff2l | y4 | 128 | 79.21 | 129.32 | 36806.45 | 3976.91 | 31.07 |
eff2l_360 | y4 | 1 | 72.96 | 78.12 | 29512.30 | 148.57 | 148.57 |
eff2l_360 | y4 | 8 | 67.72 | 102.14 | 29740.10 | 293.71 | 36.71 |
eff2l_360 | y4 | 16 | 67.72 | 73.92 | 444.92 | 493.19 | 30.82 |
eff2l_360 | y4 | 32 | 70.2 | 85.18 | 955.17 | 910.69 | 28.46 |
eff2l_360 | y4 | 64 | 70.2 | 90.43 | 1857.24 | 1856.95 | 29.01 |
eff2l_360 | y4 | 128 | 70.2 | 121.25 | 33749.48 | 4055.34 | 31.68 |
eff2m | y4 | 1 | 57.41 | 69.7 | 22034.85 | 129.54 | 129.54 |
eff2m | y4 | 8 | 69.28 | 102.92 | 22930.01 | 345.51 | 43.19 |
eff2m | y4 | 16 | 69.28 | 73.59 | 371.38 | 391.84 | 24.49 |
eff2m | y4 | 32 | 75.41 | 79.76 | 804.21 | 717.96 | 22.44 |
eff2m | y4 | 64 | 75.41 | 90.09 | 1455.46 | 1485.78 | 23.22 |
eff2m | y4 | 128 | 75.41 | 115.63 | 23679.09 | 3401.27 | 26.57 |
eff2s | y4 | 1 | 39.24 | 64.93 | 16928.48 | 119.87 | 119.87 |
eff2s | y4 | 8 | 37.01 | 87.53 | 17234.22 | 208.43 | 26.05 |
eff2s | y4 | 8 | 46.17 | 39025.55 | 210.69 | 26.34 | |
eff2s | y4 | 16 | 37.01 | 70.84 | 444.42 | 341.08 | 21.32 |
eff2s | y4 | 32 | 42.79 | 77.82 | 863.95 | 642.54 | 20.08 |
eff2s | y4 | 64 | 42.79 | 89.46 | 1614.69 | 1343.52 | 20.99 |
eff2s | y4 | 128 | 42.79 | 113.12 | 26511.61 | 2976.25 | 23.25 |
mob3l | y4 | 1 | 28.05 | 78.2 | 21949.89 | 95.03 | 95.03 |
mob3l | y4 | 8 | 32.06 | 86.01 | 19962.75 | 165.93 | 20.74 |
mob3l | y4 | 16 | 32.06 | 81.02 | 244.84 | 260.80 | 16.30 |
mob3l | y4 | 32 | 27.17 | 79.88 | 597.45 | 498.05 | 15.56 |
mob3l | y4 | 64 | 27.17 | 91.02 | 1159.92 | 1111.80 | 17.37 |
mob3l | y4 | 128 | 27.17 | 118.33 | 22417.67 | 2577.09 | 20.13 |
mob3l | y4t | 1 | 21.57 | 47.48 | 9542.37 | 52.28 | 52.28 |
mob3l | y4t | 8 | 21.03 | 72.27 | 9644.42 | 131.21 | 16.40 |
mob3l | y4t | 16 | 21.03 | 66.23 | 190.44 | 169.23 | 10.58 |
mob3l | y4t | 32 | 21.61 | 71.97 | 392.98 | 350.23 | 10.94 |
mob3l | y4t | 64 | 21.61 | 82.26 | 676.51 | 717.93 | 11.22 |
mob3l | y4t | 128 | 21.61 | 90.92 | 11668.84 | 1856.07 | 14.50 |
mob3s | y4 | 1 | 24.16 | 52083.48 | 10969.11 | 10969.11 | |
mob3s | y4 | 8 | 23.7 | 99.66 | 22361.51 | 158.68 | 19.83 |
mob3s | y4 | 16 | 23.7 | 76.41 | 253.11 | 260.37 | 16.27 |
mob3s | y4 | 32 | 23.7 | 81.05 | 510.25 | 498.53 | 15.58 |
mob3s | y4t | 1 | 15.92 | 51.66 | 15179.46 | 50.78 | 50.78 |
mob3s | y4t | 8 | 15.84 | 82.27 | 14242.68 | 124.19 | 15.52 |
mob3s | y4t | 16 | 15.84 | 60.38 | 160.62 | 173.93 | 10.87 |
mob3s | y4t | 32 | 15.74 | 73.8 | 333.28 | 337.94 | 10.56 |
mob3s | y4t | 64 | 15.74 | 83.16 | 606.98 | 718.15 | 11.22 |
mob3s | y4t | 128 | 15.74 | 100.41 | 16325.95 | 1875.85 | 14.66 |
rn101 | y4 | 1 | 45.55 | 69.32 | 23703.08 | 113.88 | 113.88 |
rn101 | y4 | 8 | 42.32 | 94.27 | 23748.03 | 220.17 | 27.52 |
rn101 | y4 | 16 | 42.32 | 73.14 | 315.59 | 354.65 | 22.17 |
rn101 | y4 | 32 | 41.79 | 79.59 | 724.29 | 651.06 | 20.35 |
rn101 | y4 | 64 | 41.79 | 89.47 | 1337.07 | 1321.07 | 20.64 |
rn101 | y4 | 128 | 41.79 | 113.91 | 27209.60 | 3061.64 | 23.92 |
rn152 | y4 | 1 | 56.9 | 77.51 | 30653.02 | 127.11 | 127.11 |
rn152 | y4 | 8 | 54.83 | 106.07 | 30717.84 | 225.73 | 28.22 |
rn152 | y4 | 16 | 54.83 | 75.51 | 361.81 | 376.26 | 23.52 |
rn152 | y4 | 32 | 54.83 | 72.16 | 1059.79 | 694.71 | 21.71 |
rn18 | y4 | 1 | 21.07 | 75.61 | 21613.34 | 90.91 | 90.91 |
rn18 | y4 | 8 | 20.3 | 93.12 | 21005.99 | 157.36 | 19.67 |
rn18 | y4 | 16 | 20.3 | 73.78 | 246.42 | 256.46 | 16.03 |
rn18 | y4 | 32 | 20.24 | 80.91 | 531.65 | 489.85 | 15.31 |
rn18 | y4 | 64 | 20.24 | 91.2 | 1011.25 | 1059.32 | 16.55 |
rn18 | y4 | 128 | 20.24 | 113.03 | 23795.51 | 2535.79 | 19.81 |
rn34 | y4 | 1 | 28.04 | 64.76 | 20548.91 | 97.98 | 97.98 |
rn34 | y4 | 8 | 28.28 | 94.69 | 19514.33 | 171.81 | 21.48 |
rn34 | y4 | 16 | 28.28 | 72.13 | 250.87 | 270.24 | 16.89 |
rn34 | y4 | 32 | 28.37 | 78.21 | 587.16 | 526.36 | 16.45 |
rn34 | y4 | 64 | 28.37 | 93.74 | 1143.86 | 1150.37 | 17.97 |
rn34 | y4 | 128 | 28.37 | 112.3 | 21840.20 | 2724.95 | 21.29 |
rn50 | y4 | 1 | 36.97 | 69.86 | 25330.24 | 99.48 | 99.48 |
rn50 | y4 | 8 | 35.58 | 89.12 | 23896.04 | 179.18 | 22.40 |
rn50 | y4 | 16 | 35.58 | 80.06 | 289.07 | 299.34 | 18.71 |
rn50 | y4 | 32 | 35.58 | 78.47 | 605.76 | 567.36 |
(full_time
and load
in s, rest in ms)
As was expected: batching up the computations leads to significant speed-up. Unfortunately this will not be feasible with low latency in a real-time processing fashion. The fastest model for batchsize=1
was mob3s_y4t
with 50ms. I would like to get below 30ms or even 15ms using TensorRT
.
What do you think? Is this a good avenue to go down, or should I try something other than TensorRT?
Thanks! Tobi
Thanks a lot for the detailed analysis! I will try to come back to this soon, but meanwhile what you can do is perhaps extract out the raw model from the convenient "packaged" one, and try to make that run under TensorRT. If that works, we can look at which part of the surrounding code is the culprit.
So you can try
model = tf.saved_model.load(...)
tf.saved_model.save(model.crop_model, 'somepath')
Then try to work with this newly saved model. You can check the interface of this resulting model at the API readme.
Part of the story may be that I saved the full model with options=tf.saved_model.SaveOptions(experimental_custom_gradients=True)
or perhaps something with tf.raw_ops.ImageProjectiveTransformV3
.
@tobibaum ,have you solved how to get the real time fps ? i run it in ubuntu18+2080ti+tf2.6+cuda11.2+cudnn8.1+model.estimate_poses, the fps is 10f/s, now i will change to use the gpu:3090 to test it in win10 and ubuntu18, are you sucessful in TensorRT ? if sucess, how ?
Thanks a lot for the detailed analysis! I will try to come back to this soon, but meanwhile what you can do is perhaps extract out the raw model from the convenient "packaged" one, and try to make that run under TensorRT. If that works, we can look at which part of the surrounding code is the culprit.
So you can try
model = tf.saved_model.load(...) tf.saved_model.save(model.crop_model, 'somepath')
Then try to work with this newly saved model. You can check the interface of this resulting model at the API readme.
Part of the story may be that I saved the full model with
options=tf.saved_model.SaveOptions(experimental_custom_gradients=True)
or perhaps something withtf.raw_ops.ImageProjectiveTransformV3
.
can you make a demo.py about it, i have seen the api, but it is too brief
@isarandi thanks for the great pointers! My current approach is to build the backbone model (effnet-l
) using your original training code and then copying over the trained model weights from savedmodels in your model zoo.
With that, I am able to create an onnx version of the backbone and compare speed-ups in c++ using tiny-tensorrt.
I loosely follow the nvidia-tutorial to do so
- load trained model weights
model = tf.saved_model.load(model_folder)
vars = model.crop_model.variables
- create blank effnet model
from backbones.efficientnet.effnetv2_model import *
import backbones.efficientnet.effnetv2_utils as effnet_util
import tfu
effnet_util.set_batchnorm(effnet_util.BatchNormalization)
tfu.set_data_format('NHWC')
tfu.set_dtype(tf.float16)
mod = get_model('efficientnetv2-s', include_top=False, pretrained=False, with_endpoints=False)
- copy over trained weights:
new_vars = mod_met.variables
var_dict = {v.name: [v, i] for i, v in enumerate(vars)}
var_dict_new = {v.name: [v, i] for i, v in enumerate(new_vars)}
inds = [var_dict[k][1] for k in var_dict_new.keys() if k in var_dict]
print(len(var_dict_new))
print(len(inds))
missing_keys = set(var_dict.keys()) - set(var_dict_new.keys())
rev_missing_keys = set(var_dict_new.keys()) - set(var_dict.keys())
print(missing_keys)
print(rev_missing_keys)
for m in missing_keys:
d = var_dict[m][0]
print(d.name, d.shape)
pick_vars = [vars[i] for i in inds]
print(len(pick_vars))
mod_met.set_weights(pick_vars)
- save model with proper signature
@tf.function()
def my_predict(my_prediction_inputs, **kwargs):
prediction = model([my_prediction_inputs], training=False)
return {"prediction": prediction}
my_signatures = my_predict.get_concrete_function(
my_prediction_inputs=tf.TensorSpec([256, 256, 3], dtype=tf.dtypes.float32, name="image")
)
tf.saved_model.save(mod, out_fold, signatures=my_signatures)
- convert to onnx (install tf2onnx first)
$ python -m tf2onnx.convert --saved-model effnet_raw_sig --output effnet.onnx
-
optimize runtime plan according to nvidia-tutorial
-
run inference in c++ with tiny-tensorrt.
@gao123qiang I will update this issue, once I had some more success on the full pipeline. The above is just my current approach to determine whether the efficientnet backbone can be sped up a sufficient amount w/ tensorrt. from my preliminary tests (no guarantee) I get the following timings: c++ API just the efficientnet backbone (256x256x3 -> 1x8x8x1280) in tensorflow: 25~30ms per image tensorrt: 3~5ms.
There is of course some more overhead with image preprocessing and output post processing (plus running the metrabs head), but overall I think this looks promising.
I will update the issue, once I get a real-time system running (or abandoned the approach). Please feel free to share any thoughts or experiments you perform to speed this whole thing up :)
thank you, i will test it. Looking forward to your update !
@tobibaum , i have seen the nvidia-tutorial and you provided, main: savedModel --> .pb --> .onnx the mod_met in step2 is keras.models.clone_model(model).crop_model ? in step3, the input shape is 2562563, can i change the size ?
Hey @gao123qiang ,
if I understand correctly, the backbones of the overall models are trained on 256x256 patches. Since they are not fully convolutional nets, they therefore depend on the input being of that same size. in the packaging section of the readme, the author first describes how the metrabs core model is packaged to run on 256x256 images.
Also, you cannot feed your raw images into this, you first need to run a detector to determine the location of ppl in your scene (and normalize the input). My above comments were a guideline on how to dig into the metrabs model and investigate for potential speed-ups. You will not be able to get reasonable results by following my steps.
@tobibaum, your aproach looks very promising. Is there any progress in your approach?
I finally have a working version of my approach. I had to perform some major surgery on the provided models, but I confirmed, that the results stay approximately the same, just at higher speeds. (I use a 3rd party yolo, thus the crops and everything downstream will slightly differ). Here's what I did:
Split the saved_model
into:
- detection (accelerate with TensorRT)
- backbone (accelerate with TensorRT)
- metrabs head (run as is)
I was then able to convert the first two into tensorrt models. the metrabs head have some operations in them that tensorrt does not like, but since they are just the last layer and some function wrappers, they run very fast in tensorflow using the C-API. I then wrapped these three models in C++ code and get the following timings on an RTX2070:
model | fps |
---|---|
efficientnet-l | 37 |
efficientnet-s | 50 |
resnet50 | 58 |
resnet152 | 45 |
check it out: https://github.com/tobibaum/metrabs_trt
@isarandi it would be great, if you could have a look over my approach and check whether I made any breaking mistakes. thanks!!
good job
@tobibaum Thank you very much for you great how to!
With your help, I was able to generate the onnx and plan file of the backbone. Until now, I did not the last step (Compile Your C++ Version). I would prefer a python version because otherwise I have to build a cpp extension for that. For my python version I used the engine and inference code of the Nvidia tutorial. But now I get the error
[01/06/2022-14:52:12] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 7518, GPU 9372 (MiB)
[01/06/2022-14:52:12] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 7518, GPU 9382 (MiB)
[01/06/2022-14:52:12] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +27, now: CPU 0, GPU 248 (MiB)
[01/06/2022-14:52:12] [TRT] [E] 1: Unexpected exception
Cuda, Tensorrt and all other needed libraries are natively installed on my system. The outgoing data consists completely of zeros and there is no error in the python code. Do you have experience with that? Does this approach also work if I set a dynamic input size in the signature? In my use case I have to change my batch size dynamically during runtime. Sorry for these questions but tensorrt is completely new to me.
In the next step, I will try your c++ version. Maybe that works for me.
Hey @Basti110 ,
unfortunately I cannot tell what the error might be here. could you try to run the nvidia inference engine with their models to pinpoint whether the problem is in the setup or the compiled model?
cheers!
Hey,
I think it is only a problem in the python enviroment. The C++ version with tiny-tensorrt works! Thank You.
Thanks for testing my implementation :)