VILA Performance Gap Between Self-Hosted VILA Model and NVIDIA VILA API

Infrastructure: GPU-supported EC2 instances
Implementation: FastAPI wrapper on top of VILA inference command
Problem: Significant performance gap compared to NVIDIA VILA API responses

Performance Gap Between Self-Hosted VILA Model and NVIDIA VILA API - Need Parameter Configuration Guidance

Open cravirajan opened this issue 5 months ago • 5 comments

Issue Category: Model Performance & Configuration

Detailed Description:

Performance Discrepancy:
- Self-deployed VILA models showing inferior results compared to NVIDIA VILA API
- Suspect NVIDIA API may be using larger/more trained models (potentially >40B parameters)
- Note: As of January 6, 2025 VILA is now part of the new Cosmos Nemotron vision language models
Missing Parameter Configuration:
- Unable to configure inference parameters in current FastAPI implementation
- Need to pass: temperature, top_p, seed values to deployed model
- Current setup doesn't support these sampling parameters

Model Specifications:
- What are the exact model parameters/versions used in NVIDIA VILA API?
- Are there larger parameter models (>40B) available that aren't in public repositories?
- Is there will be any change if self hosted api build directly on top of ec2 without using the nvidia providing metropolis services and all?
Parameter Configuration:
- How to properly implement temperature, top_p, and seed parameters in VILA inference?
- Best practices for FastAPI wrapper configuration with these parameters

Clear documentation on implementing sampling parameters (temperature, top_p, seed)
Guidance on model selection to match NVIDIA API performance
Best practices for FastAPI deployment with proper parameter support

Jun 17 '25 10:06 cravirajan