Depth-Anything icon indicating copy to clipboard operation
Depth-Anything copied to clipboard

Reducing the model size/making it lighter than the original vit-s and eventually reduce the flops.

Open abhishek0696 opened this issue 10 months ago • 4 comments

Hi I am getting decent enough results on the vit-s architecture model for metric depth which shows it has around 22-25 million parameters. I want to make the model even faster/lighter, I could think of achieving this via architecture change/reduction, swapping another encoder, etc. Please suggest if and how I can make the model lighter, reduce ethe flops, and eventually the inference faster, thanks in advance!

abhishek0696 avatar Apr 01 '24 20:04 abhishek0696

You can use FP16 to reduce the size and make it faster. I am investigating into INT8 quantization or QAT as mentioned here (https://github.com/LiheYoung/Depth-Anything/issues/137), but looks like authors have not released training code so QAT may be challenging.

TouqeerAhmad avatar Apr 01 '24 20:04 TouqeerAhmad

@TouqeerAhmad Oh you mean just converting the input tensor from fp32 to fp16 while giving it to model while training and inference or are you also suggesting some change in the code, if so what changes are required?

abhishek0696 avatar Apr 01 '24 21:04 abhishek0696

No exporting the model with FP16 precision and running inference in FP16 as well, e.g., can use ONNX to export model and run inference using ONNX Runtime or TensorRT. There is no significant different in output but it reduces the flops by half.

TouqeerAhmad avatar Apr 01 '24 21:04 TouqeerAhmad

I have used tensorRT to convert the large model to fp16, it works great. I was unable to convert the giant model to fp16 though. I would love to try and make a 8 bit tensorRT version of vitl or vitg, I just haven't had the time to try yet.

on a 4090 I get 13fps at 800x800 using vitl. Not bad.

Redmond-AI avatar Sep 02 '24 19:09 Redmond-AI