mlsd_pytorch icon indicating copy to clipboard operation
mlsd_pytorch copied to clipboard

Is there any way to reduce the GPU memory usage and enhance the inference speed?

Open JinraeKim opened this issue 3 years ago • 6 comments

The M-LSD's pred_lines takes a long time than I expected, about ~6Hz (including other stuff; M-LSD-tiny only seems to be about 10Hz).

And it takes about 2G of GPU memory.

Is there a way to reduce the GPU memory usage and enhance the inference speed? (including TensorRT, etc.)

Please give me an adivce as I'm not an expert of this.

Thanks!

JinraeKim avatar Sep 09 '22 00:09 JinraeKim

You can try the TensorRT version by @rhysdg , https://github.com/lhwcv/mlsd_pytorch#benchmarks

lhwcv avatar Sep 09 '22 00:09 lhwcv

You can try the TensorRT version by @rhysdg , https://github.com/lhwcv/mlsd_pytorch#benchmarks

Thx for sharing the link.

I'm not familiar with it. TensorRT would reduce the memory usage and enhance the inference speed at the same time?

JinraeKim avatar Sep 09 '22 02:09 JinraeKim

@JinraeKim @lhwcv Apologies for the late reply, busy times! Forsure the main criteria with TensorRT is to reduce latency, and therefore increase inference speed pretty signifcantly with minimal reduction in quality at FP16. Given a successful conversion you should also see a significant reduction in memory allocation overhead.

Its worth bearing in mind that the setup I have here was developed for Jetson series devices, although my understanding is that it plays nice with Nvdia's NGC PyTorch docker container. I am hoping to start bringing in a TensorrT Python API/ Pycuda version shortly that should work across a wider range of devices. What were you hoping to deploy with @JinraeKim?

rhysdg avatar Sep 12 '22 12:09 rhysdg

@rhysdg Thank you for the detailed explanation! Yeah, I'm looking for employment with Nvidia Jetson as well, and my personal laptops for practice as well.

It gave me a really nice insight! Thank you again!

JinraeKim avatar Sep 15 '22 06:09 JinraeKim

@JinraeKim I'm working on a more robust tool over at trt-devel that adds the ability to convert custom trained models with three channel inputs as per the training code, and drops to the result into a folder named accorrding to experiment. This will eventually become a pr but I'm hoping to do a little more testing with the onnx conversion when I get a chance. For now the tool works if you need it for a custom training run, and I can confirm that the results are fantastic with @lhwcv's training script plus some added aggressive pixel level augs!

After that's done I'll work on a straight TensorRT conversion tool, that has wider device support, and also post-training quantization for the onnx representation!

rhysdg avatar Sep 21 '22 10:09 rhysdg

Ah yes, and I'm yet to update the documentation accordingly but adding the --custom experiment.pth arg with your checkpoint dropped into ./models/experiment.pth will result in a sped up representation at ./models/experiment/mlsd_large/tiny__512_trt_fp16.pth

rhysdg avatar Sep 21 '22 10:09 rhysdg