MoGe Questions about Latency Measurement

Thank you for sharing this impressive work! I’m particularly interested in the claim: "Optimized for speed: Achieves 60ms latency per image (A100 or RTX3090, FP16, ViT-L)." Could you please clarify the following details?

Latency Measurement Method: How did you measure the 60ms latency per image? For example, did you use Python’s time.time() for timing, or did you rely on tools like torch.profiler or CUDA events (torch.cuda.Event) for more precise GPU timing?

Input Image Size: What was the input image resolution used to achieve the 60ms latency result? For instance, was it a standard size like 224x224, 384x384, or something else?

These details would help me better understand the performance characteristics of your implementation. Thanks in advance for your time and insights!

Jun 23 '25 08:06 Candy-Crusher

Hi, thanks for raising the valuable question!

Measurement Method: We measure the runtime of the .infer() function using torch.cuda.synchronize(); time.time(). This timing includes the entire inference pipeline: image preprocessing, inference, and postprocessing (which includes solving for focal length and shift). As a result, the reported latency reflects both GPU and CPU time.
Input Image Size: MoGe supports flexible image resolutions and aspect ratios. The reported 60ms latency corresponds to the maximum trained resolution, which is also the default resolution_level=9, yielding 3600 ViT tokens. This translates to an effective image area of 3600 × 14 × 14 pixels. For common aspect ratios, the actual image sizes are:
- 2:1 → 1176 × 588
- 1:1 → 840 × 840
- 1:2 → 588 × 1176
More details of the runtime will be included in the coming paper of MoGe-2.

Jun 25 '25 05:06 EasternJournalist

Thanks for the quick and clear respond!

Jun 26 '25 14:06 Candy-Crusher

Hi ruicheng! I test latency on RTX 3090 according to the measurement method and input image size you mentioned. It is quite different from "Optimized for speed: Achieves 60ms latency per image (A100 or RTX3090, FP16, ViT-L)".

The code I used to measure latency is as followed:

    import cv2
    import torch
    import torch.nn.utils.prune as prune
    # from moge.model.v1 import MoGeModel # Let's try MoGe-1
    from moge.model.v2 import MoGeModel # Let's try MoGe-2
    from torch.profiler import profile, record_function, ProfilerActivity
    import json
    from datetime import datetime
    from fvcore.nn import FlopCountAnalysis, flop_count_table
    import time
    
    def main():
        device = torch.device("cuda")
    
        # Load the model from huggingface hub (or load from local).
        # model = MoGeModel.from_pretrained("Ruicheng/moge-vitl").to(device)   
        model = MoGeModel.from_pretrained("Ruicheng/moge-2-vitl").to(device)                             
        # model = MoGeModel.from_pretrained("Ruicheng/moge-2-vitl-normal").to(device)                             
        # model = MoGeModel.from_pretrained("Ruicheng/moge-2-vitb-normal").to(device)                             
        # model = MoGeModel.from_pretrained("Ruicheng/moge-2-vits-normal").to(device)                             
    
     
        # print model parameters number in M
        total_params = sum(p.numel() for p in model.parameters())
        print(f"model parameters number: {total_params / 1e6:.2f} M")
    
        # # ###########################################################
        # # test end-to-end latency using time.time()
        # # ###########################################################
            
        # create dummy input
        batch_size = 1
        H, W = 840, 840
        dummy_input = torch.randn(batch_size, 3, H, W, device=device)
    
        print("\n" + "="*50)
        print("start to test end-to-end latency using time.time()...")
    
        # -- Warm-up --
        print("warming up GPU...")
        for _ in range(10):
            _ = model.infer(dummy_input)
        torch.cuda.synchronize()
    
        # -- measure latency --
        print("start to measure latency...")
        num_runs = 20
        total_time = 0
    
        for _ in range(num_runs):
            torch.cuda.synchronize()
            start_time = time.time()
    
            _ = model.infer(dummy_input, resolution_level=9, use_fp16=False)
    
            torch.cuda.synchronize()
            end_time = time.time()
            total_time += (end_time - start_time)
    
        average_latency = (total_time / num_runs) * 1000  # ms
        throughput = batch_size / (total_time / num_runs)  # images/sec
    
        print("-" * 50)
        print(f"test config: Batch Size = {batch_size}, Input Size = {H}x{W}")
        print(f"number of runs: {num_runs} times")
        print(f"average latency: {average_latency:.3f} ms")
        print(f"throughput: {throughput:.2f} image/s")
        print("="*50)
    
    if __name__ == "__main__":
        main()

Do you think it is reasonable or there might be some bug in my code? Additionally, could you please also specify how you measure peak gpu memory? Many thanks!

Jul 16 '25 15:07 Candy-Crusher

Hi, I just noticed that the current code falls back to standard matrix multiplication-based attention when xformers is not installed. This was caused by an inconsistent historical commit (while the model code was updated, the DINOv2 dependency remained outdated). This has now been fixed by overriding the default DINOv2 attention implementation with PyTorch SDPA, which offers performance comparable to xformers. Apologies for the oversight.

Also, for a more accurate latency evaluation, I suggest testing with a real image input, as some post-processing steps are input-dependent.

Let me know if there is still any issue with reproducing the latency!

Jul 18 '25 05:07 EasternJournalist