how to improve speed on RTX3090
Hello, thanks for your excellent work!
In my own test, the mode takes 934.39 ms to predict 13 images on RTX3090 with torch2.5.1 and cuda11.8. Is it normal?
In addtion, I don't know much about flash attention, but I noticed that there isn't any package in your pyproject.toml or requirements.txt related to flash attention. I mannually installed the flash-attn v2 package but it brings no improvement to the runtime. Could you give me some suggestions?
Hi,
Starting from PyTorch 2.2, its built-in scaled_dot_product_attention function has integrated support for FlashAttention v2, so there’s no need to install FlashAttention v2 manually.
If you’re looking for faster inference, you can try manually installing FlashAttention v3, which can often yield up to 2× speed-up in compatible settings. However, I do not have access to RTX 3090, so I can’t confirm the exact gains on that GPU.
https://github.com/facebookresearch/vggt/blob/c4b5da2d8592a33d52fb6c93af333ddf35b5bcb9/vggt/layers/attention.py#L61
Hey @DX3906G can you please share what was your image size? And for inference you used the code in one of the demo scripts right? Or something else? Because for me it takes significantly longer, usually in seconds.
Hi @W-OK-E @jytime , I just directly use load_and_preprocess_images from vggt.utils.load_fn to pre-process my image, but the time for model.aggregator on A800 is so long compared to the evaluation for H100. I wonder can I adjust the target-size in the function load_and_preprocess_images ? I hope to obtain the prediction of the camera extrinsics. When I adjust the size from 518 to 224, the bias of the prediction results is a little bit large.
Hey @DX3906G can you please share what was your image size? And for inference you used the code in one of the demo scripts right? Or something else? Because for me it takes significantly longer, usually in seconds.
The original image size I used is 480*640. I resized them to 512 before feeding into the network. The script I used is the same as https://github.com/facebookresearch/vggt/issues/21.
Hi @W-OK-E @jytime , I just directly use load_and_preprocess_images from vggt.utils.load_fn to pre-process my image, but the time for model.aggregator on A800 is so long compared to the evaluation for H100. I wonder can I adjust the target-size in the function
load_and_preprocess_images? I hope to obtain the prediction of the camera extrinsics. When I adjust the size from 518 to 224, the bias of the prediction results is a little bit large.
Hi @betray12138 , have you solved this problem? I have the same results on camera extrinsics.