Depth-Anything icon indicating copy to clipboard operation
Depth-Anything copied to clipboard

Inference time is slower than expected

Open junbangliang opened this issue 1 year ago • 19 comments

Hi,

Thanks for sharing the work, when I try to run the vitl example in an A100 gpu, I found the inference time settles down to around 120ms rather than 13ms as stated in the repo, is there a reason for this? I provided the experiment I ran below.

Thanks!

import cv2
import numpy as np
import os
import torch
import torch.nn.functional as F
from torchvision.transforms import Compose

from depth_anything.dpt import DepthAnything
from depth_anything.util.transform import Resize, NormalizeImage, PrepareForNet

import matplotlib.pyplot as plt

if __name__ == '__main__':

    os.environ['CUDA_VISIBLE_DEVICES'] = '0'
    
    DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

    gpu_name = torch.cuda.get_device_name(torch.cuda.current_device())
    print(f"GPU being used: {gpu_name}")
    
    encoder = 'vitl'
    depth_anything = DepthAnything.from_pretrained('LiheYoung/depth_anything_{}14'.format(encoder)).to(DEVICE).eval()
    
    total_params = sum(param.numel() for param in depth_anything.parameters())
    print('Total parameters: {:.2f}M'.format(total_params / 1e6))
    
    transform = Compose([
        Resize(
            width=518,
            height=518,
            resize_target=False,
            keep_aspect_ratio=True,
            ensure_multiple_of=14,
            resize_method='lower_bound',
            image_interpolation_method=cv2.INTER_CUBIC,
        ),
        NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        PrepareForNet(),
    ])

    filename = "assets/examples/demo1.png"

    raw_image = cv2.imread(filename)
    image = cv2.cvtColor(raw_image, cv2.COLOR_BGR2RGB) / 255.0
    
    h, w = image.shape[:2]
    
    image = transform({'image': image})['image']
    image = torch.from_numpy(image).unsqueeze(0).to(DEVICE)
    
    print(f"image shape: {image.shape}")

    with torch.no_grad():
        import time
        for i in range(1000):
            start = time.perf_counter()
            depth = depth_anything(image)
            print(f"inference time is: {time.perf_counter() - start}s")
GPU being used: NVIDIA A100-SXM4-80GB
Total parameters: 335.32M
image shape: torch.Size([1, 3, 518, 784])
inference time is: 3.4120892197825015s
inference time is: 0.014787798281759024s
inference time is: 0.01355740800499916s
inference time is: 0.10093897487968206s
inference time is: 0.12020917888730764s
inference time is: 0.11985550913959742s
inference time is: 0.12007139809429646s
inference time is: 0.1200293991714716s
inference time is: 0.12007084907963872s
inference time is: 0.12004875903949142s
inference time is: 0.12011446803808212s

junbangliang avatar Jan 27 '24 20:01 junbangliang

Hi, I tried your code. The output is similar to yours. Strangely, as shown in your log, the second and third loops require only 0.0147s and 0.0135 seconds respectively, which is close to our reported inference time (yours are still a little larger because our inference time is tested on 518x518 resolution). Honestly, I also have no idea why the inference time suddenly becomes 10x larger from the fourth loop.

Intriguingly, if you insert the time counter in our run.py, you will find the obtained inference time is normal, and always consistent with our reported ones. Wish you could have a try.

LiheYoung avatar Jan 28 '24 04:01 LiheYoung

I think testing speed this way (the perf_counter calls before and after calling the model) gives inconsistent results due to cuda synchronization. You need to force the GPU to sync up before printing the inference time, or else the print can happen before the gpu is done (which is likely where the 0.014/0.013 times are coming from).

You can force a sync by moving the data back to the cpu (i.e. using depth_anything(image).cpu()) or explicitly sync up:

start = time.perf_counter()
depth = depth_anything(image)
torch.cuda.synchronize()
print(f"inference time is: {time.perf_counter() - start}s")

heyoeyo avatar Jan 28 '24 15:01 heyoeyo

With the gpu synchronization implemented, the script now gives the following inference time. Is there a way to speed up the inference for a large amount of images?

GPU being used: NVIDIA A100-SXM4-80GB
Total parameters: 335.32M
image shape: torch.Size([1, 3, 518, 784])
inference time is: 3.3442822340875864s
inference time is: 0.12628611270338297s
inference time is: 0.12091294582933187s
inference time is: 0.11994686676189303s
inference time is: 0.12029408616945148s
inference time is: 0.1201697769574821s
inference time is: 0.12012322712689638s
inference time is: 0.12011239631101489s
inference time is: 0.12021539593115449s
inference time is: 0.12010658718645573s
inference time is: 0.12012855615466833s

junbangliang avatar Jan 28 '24 18:01 junbangliang

Is there a way to speed up the inference for a large amount of images?

The usual speed up for lots of images is to batch them together. You can do this with: image_batch = torch.cat((image1, image2, image3, ... etc)) This comes at the cost of higher VRAM usage. Doing this should reduce the amount of back-and-forth between the cpu and gpu.

You might also get a small speed up by changing the torch.no_grad() part to instead use torch.inference_mode(), if you're using a newer version of pytorch.

You can also try using the torch.channels_last memory format, though whether this helps will depend on the model, and very slightly alters the results (from what I've seen). This is something you do like setting the device: data.to(device, memory_format=torch.channels_last)

Lastly, you might get a big speed up by using torch.float16, but at the expense of slightly worse results usually (and in some cases, you can get NaN/inf results that wouldn't occur with the default float32 type, in that case torch.bfloat16 may work better). You also do this like setting the device: data.to(device, dtype=torch.float16)

heyoeyo avatar Jan 28 '24 23:01 heyoeyo

@jlia904 Even after you corrected the code snippet with the torch.cuda.synchronize(), your inference speed settles around 120ms, which is 10 times slower albeit at a higher resolution. Did you try with the resolution of 512x512 to see if you could reproduce the numbers reported by the authors?

kishore-greddy avatar Feb 06 '24 16:02 kishore-greddy

@kishore-greddy Yes I did try 512x512 resolution. The inference speed is still over 100ms.

junbangliang avatar Feb 06 '24 16:02 junbangliang

@LiheYoung As reported by @jlia904 , I also tried inferring on 512x512 image resolution on the tesla v100-dgxs-32gb, and my inference time was around 130ms which is nowhere close to 20ms as reported by you. Could you recheck your numbers or share a code snippet that you use to get the 20ms inference time on V100?

image

kishore-greddy avatar Feb 06 '24 16:02 kishore-greddy

@jlia904 Thanks for the reply. Do you know the possible reason for it? Or do you think that the reported numbers are wrong?

kishore-greddy avatar Feb 06 '24 16:02 kishore-greddy

@kishore-greddy Now I tried it on another A100 machine and now I can get down to 70ms, results vary between machines but still not close to numbers reported from the authors.

GPU being used: NVIDIA A100-SXM4-80GB Total parameters: 335.32M image shape: torch.Size([1, 3, 518, 518]) inference time is: 1.75464620799994s inference time is: 0.08826967400000285s inference time is: 0.08717626499992548s inference time is: 0.07318027000019356s inference time is: 0.07299402100011321s inference time is: 0.07296579000012571s inference time is: 0.07296102099985546s inference time is: 0.07299358000000211s inference time is: 0.07297067099989363s inference time is: 0.07296242999996139s inference time is: 0.07297045099994648s inference time is: 0.07316351999998005s inference time is: 0.07317171099998632s inference time is: 0.07321707999994942s inference time is: 0.07319737000011628s inference time is: 0.07319433999987268s inference time is: 0.07316986000000725s inference time is: 0.07317408099993372s inference time is: 0.07317609000006087s

junbangliang avatar Feb 06 '24 16:02 junbangliang

Do you know the possible reason for it?

It could just be that they've left out info about how they're running the model. If they use float16, that can knock ~50% off the time and batching can reduce that another ~25% (by comparison, inference_mode and channels_last memory formatting don't seem to do much for these models). Using xFormers knocks another 25% when using float16. With these changes, I get the following numbers on a 3090 @ 518x518:

vit-small:
GPU being used: NVIDIA GeForce RTX 3090
Total parameters: 24.79M
image dtype: torch.float16
image shape: torch.Size([32, 3, 518, 518])
batch size: 32
Per-image time: 7.4 ms
Per-image time: 3.3 ms
Per-image time: 3.2 ms
Per-image time: 3.2 ms
Per-image time: 3.0 ms
Per-image time: 3.1 ms
Per-image time: 3.2 ms
vit-base:
GPU being used: NVIDIA GeForce RTX 3090
Total parameters: 97.47M
image dtype: torch.float16
image shape: torch.Size([32, 3, 518, 518])
batch size: 32
Per-image time: 11.5 ms
Per-image time: 7.7 ms
Per-image time: 7.7 ms
Per-image time: 7.6 ms
Per-image time: 7.7 ms
Per-image time: 7.7 ms
Per-image time: 7.6 ms
vit-large:
GPU being used: NVIDIA GeForce RTX 3090
Total parameters: 335.32M
image dtype: torch.float16
image shape: torch.Size([32, 3, 518, 518])
batch size: 32
Per-image time: 25.5 ms
Per-image time: 23.2 ms
Per-image time: 23.1 ms
Per-image time: 23.0 ms
Per-image time: 23.0 ms
Per-image time: 23.0 ms
Per-image time: 23.1 ms

For reference, without these changes, vit-large takes around 94ms per image.

I'm not familiar with the A100/V100 and where they stand vs. the 3090, but these numbers seem reasonable compared to the reported 4090 numbers, assuming the tests were done with float16 or bfloat16.

heyoeyo avatar Feb 06 '24 22:02 heyoeyo

@heyoeyo Thanks for your reply, I will try it out on my side as well with the change in precision and update the results

kishore-greddy avatar Feb 08 '24 10:02 kishore-greddy

@heyoeyo Could you provide a bit more detail or share the modified code on how to add batching? I'm not super familiar with all the torch stuff and how you would fully implement these changes 😅

Bolt-Scripts avatar Jun 17 '24 17:06 Bolt-Scripts

Sure, here's a modified version of the code @jlia904 posted originally. The biggest change is just using a device_config dictionary in place of the original DEVICE value. This lets you set the data type (e.g. float16) when moving data to the gpu.

import time

import cv2
import torch
from torchvision.transforms import Compose

from depth_anything.dpt import DepthAnything
from depth_anything.util.transform import Resize, NormalizeImage, PrepareForNet

# Settings
encoder = "vits"
use_channels_last = False
use_batching_example = False
use_float16 = False

# Example of passing 1 or 4 images to the model
files_list = ["assets/examples/demo1.png"] # Batch of 1
if use_batching_example:
    files_list = [
        "assets/examples/demo1.png",
        "assets/examples/demo2.png",
        "assets/examples/demo3.png",
        "assets/examples/demo4.png",
    ]

# Set up device/data type for image & model weights
device_config = {
    "device": "cuda",
    "dtype": torch.float16 if use_float16 else torch.float32,
    "memory_format": torch.channels_last if use_channels_last else None
}

depth_anything = DepthAnything.from_pretrained('LiheYoung/depth_anything_{}14'.format(encoder)).eval()
depth_anything.to(**device_config)
transform = Compose([
    Resize(
        width=518,
        height=518,
        resize_target=False,
        keep_aspect_ratio=False,
        ensure_multiple_of=14,
        resize_method='lower_bound',
        image_interpolation_method=cv2.INTER_CUBIC,
    ),
    NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    PrepareForNet(),
])

# Loading & pre-processing image data
image_list = []
for filename in files_list:
    raw_image = cv2.imread(filename)
    image = cv2.cvtColor(raw_image, cv2.COLOR_BGR2RGB) / 255.0
    image = transform({'image': image})['image']
    image = torch.from_numpy(image).unsqueeze(0)
    image_list.append(image)
image_batch = torch.cat(image_list).to(**device_config)

print("GPU:", torch.cuda.get_device_name(torch.cuda.current_device()))
print("Image batch shape:", image_batch.shape)
print("Device Config:", device_config)

# Computing depth results
batch_size = image_batch.shape[0]
with torch.no_grad():
    for i in range(24):
        start = time.perf_counter()
        depth = depth_anything(image_batch).cpu()
        total_time_ms = 1000 * (time.perf_counter() - start)
        print("Per-image time:", round(total_time_ms/batch_size, 1), "ms")

# For feedback, check if xformers is installed from: pip install xformers
# (model uses it automatically if available, it helps with float16 speed/memory use)
using_xformers = False
try:
    import xformers
    using_xformers = True
except ImportError:
    pass

# Some feedback at the end, for reference
print("Using channels last format:", use_channels_last)
print("Using batching:", use_batching_example)
print("Using float 16:", use_float16)
print("Using xformers:", using_xformers)

The script should be placed in the root of the DepthAnything folder, so that all the import/image paths work properly. You can adjust the settings at the top to toggle the options on and off to see what effect they have on the running speed. The batching in this case is only using 1 (no batching) or 4 images. You get more improvement by using higher batch sizes, but you need a use-case where you'd even have a bunch of images all ready at once to take advantage of this.

Also, shameless plug :p, but I have a repo that has some of these speed-ups built-in and includes a video script in case that's of any use.

heyoeyo avatar Jun 17 '24 20:06 heyoeyo

@heyoeyo Thanks, much appreciated 😋 Currently I'm working on a system for streaming depth frames from a video for realtime use actually. I think it'd maybe be helpful to be able to batch the frames ahead of the video to increase performance. Because as it is I end up with low gpu utilization and can only really process a bit over 15fps, almost regardless of model and resolution. It feels as if the gpu just spits stuff out faster than it can be supplied with new data. Using fp16 and stuff doesn't really make a difference, probably because the bottleneck is elsewhere, my guess being just excessive gpu sync from all the stuff going around. Which is why batching seems like it'd help a lot here if that's right.

But I don't really know enough about all this torch malarky to really understand like what operations might cause issues or how to speed this up. I'm unclear on where certain operations even take place, like it looks like the image transformations and such happen on cpu since the to.(device) stuff happens afterwards, but maybe I'm wrong about that. But point being idk if stuff like that is sucking away processing times or what, or if certain cpu side things could be done on separate threads to have data always prepared for the gpu. idk man. So if you have any more tips on how to minimize downtime and increase gpu occupancy, that'd be great 😅

Bolt-Scripts avatar Jun 17 '24 23:06 Bolt-Scripts

depth frames from a video for realtime use

This can be a tricky use case to optimize I think. While batching can help, it directly opposes the requirement of it being realtime, since you'd need to wait for frames to form each batch. For example, if you form a batch of 32 (which seems to give a decent bump in performance), then that would lead to a ~1 second delay (assuming 30fps) in processing the first frame of that batch. So there may be a limit to the benefit of batching, depending on how strict your realtime requirement is.

can only really process a bit over 15fps, almost regardless of model and resolution

This seems surprising to me! There should be a very noticeable difference between the vit-small and vit-large processing speeds. I'd assume this means that the bottleneck may be reading frames from the video (this can be very slow for certain codecs, like h265), or that it's a result of how the time is being measured? It's hard to say.

Using fp16 and stuff doesn't really make a difference

This is also surprising. Just to be sure, that code I posted doesn't use float16 by default (the use_float16 variable needs to be set to True), so in case you ran it as-is and didn't see any difference, that might be why?

all this torch malarky to really understand like what operations might cause issues or how to speed this up

Ya the asynchronous stuff is confusing in pytorch, since it's not explicit (like async/await in javascript for example). When you need to move something to the cpu, there's actually a 'non-blocking' argument that can be passed in to delay the sync, though it's behavior can be confusing! In the code above, instead of moving the depth data to the cpu using .cpu(), you can do something like: depth = depth_anything(image_batch).to("cpu", non_blocking=True) This seems to let the code 'run ahead' until the next line where the depth variable is actually needed. It doesn't do much for the code above, but may be helpful for your use case, since it's a bit like using multiple threads.

it looks like the image transformations and such happen on cpu

Yes that's correct. There was another post about moving these operations to the GPU (issue #173) and that poster said they got some noticeable improvements. They posted a link to their updated code, so may be worth checking out.

any more tips on how to minimize downtime and increase gpu occupancy

I'd recommend double checking that there's not some issue with reading frames quickly enough (e.g. if the frames are being read at 15fps, then no amount of GPU optimization can produce an output faster than 15fps). I usually just drop perf_counter() calls all over the place to figure out what's taking the most time (though this can be tricky with the pytorch/cuda stuff). Otherwise, there's also other runtimes (e.g. tensorRT) which seem to make better use of the GPU, so that's worth considering (that image transformation issue is worth checking out for this too, since that user seemed to be using tensorRT).

heyoeyo avatar Jun 18 '24 15:06 heyoeyo