Questions about Latency Measurement
Thank you for sharing this impressive work! I’m particularly interested in the claim: "Optimized for speed: Achieves 60ms latency per image (A100 or RTX3090, FP16, ViT-L)." Could you please clarify the following details?
Latency Measurement Method: How did you measure the 60ms latency per image? For example, did you use Python’s time.time() for timing, or did you rely on tools like torch.profiler or CUDA events (torch.cuda.Event) for more precise GPU timing?
Input Image Size: What was the input image resolution used to achieve the 60ms latency result? For instance, was it a standard size like 224x224, 384x384, or something else?
These details would help me better understand the performance characteristics of your implementation. Thanks in advance for your time and insights!
Hi, thanks for raising the valuable question!
-
Measurement Method: We measure the runtime of the .infer() function using
torch.cuda.synchronize(); time.time(). This timing includes the entire inference pipeline: image preprocessing, inference, and postprocessing (which includes solving for focal length and shift). As a result, the reported latency reflects both GPU and CPU time. -
Input Image Size: MoGe supports flexible image resolutions and aspect ratios. The reported 60ms latency corresponds to the maximum trained resolution, which is also the default
resolution_level=9, yielding 3600 ViT tokens. This translates to an effective image area of 3600 × 14 × 14 pixels. For common aspect ratios, the actual image sizes are:- 2:1 → 1176 × 588
- 1:1 → 840 × 840
- 1:2 → 588 × 1176
More details of the runtime will be included in the coming paper of MoGe-2.
Thanks for the quick and clear respond!
Hi ruicheng! I test latency on RTX 3090 according to the measurement method and input image size you mentioned. It is quite different from "Optimized for speed: Achieves 60ms latency per image (A100 or RTX3090, FP16, ViT-L)".
The code I used to measure latency is as followed:
import cv2
import torch
import torch.nn.utils.prune as prune
# from moge.model.v1 import MoGeModel # Let's try MoGe-1
from moge.model.v2 import MoGeModel # Let's try MoGe-2
from torch.profiler import profile, record_function, ProfilerActivity
import json
from datetime import datetime
from fvcore.nn import FlopCountAnalysis, flop_count_table
import time
def main():
device = torch.device("cuda")
# Load the model from huggingface hub (or load from local).
# model = MoGeModel.from_pretrained("Ruicheng/moge-vitl").to(device)
model = MoGeModel.from_pretrained("Ruicheng/moge-2-vitl").to(device)
# model = MoGeModel.from_pretrained("Ruicheng/moge-2-vitl-normal").to(device)
# model = MoGeModel.from_pretrained("Ruicheng/moge-2-vitb-normal").to(device)
# model = MoGeModel.from_pretrained("Ruicheng/moge-2-vits-normal").to(device)
# print model parameters number in M
total_params = sum(p.numel() for p in model.parameters())
print(f"model parameters number: {total_params / 1e6:.2f} M")
# # ###########################################################
# # test end-to-end latency using time.time()
# # ###########################################################
# create dummy input
batch_size = 1
H, W = 840, 840
dummy_input = torch.randn(batch_size, 3, H, W, device=device)
print("\n" + "="*50)
print("start to test end-to-end latency using time.time()...")
# -- Warm-up --
print("warming up GPU...")
for _ in range(10):
_ = model.infer(dummy_input)
torch.cuda.synchronize()
# -- measure latency --
print("start to measure latency...")
num_runs = 20
total_time = 0
for _ in range(num_runs):
torch.cuda.synchronize()
start_time = time.time()
_ = model.infer(dummy_input, resolution_level=9, use_fp16=False)
torch.cuda.synchronize()
end_time = time.time()
total_time += (end_time - start_time)
average_latency = (total_time / num_runs) * 1000 # ms
throughput = batch_size / (total_time / num_runs) # images/sec
print("-" * 50)
print(f"test config: Batch Size = {batch_size}, Input Size = {H}x{W}")
print(f"number of runs: {num_runs} times")
print(f"average latency: {average_latency:.3f} ms")
print(f"throughput: {throughput:.2f} image/s")
print("="*50)
if __name__ == "__main__":
main()
Do you think it is reasonable or there might be some bug in my code? Additionally, could you please also specify how you measure peak gpu memory? Many thanks!
Hi, I just noticed that the current code falls back to standard matrix multiplication-based attention when xformers is not installed. This was caused by an inconsistent historical commit (while the model code was updated, the DINOv2 dependency remained outdated). This has now been fixed by overriding the default DINOv2 attention implementation with PyTorch SDPA, which offers performance comparable to xformers. Apologies for the oversight.
Also, for a more accurate latency evaluation, I suggest testing with a real image input, as some post-processing steps are input-dependent.
Let me know if there is still any issue with reproducing the latency!