VILA icon indicating copy to clipboard operation
VILA copied to clipboard

Performance Gap Between Self-Hosted VILA Model and NVIDIA VILA API - Need Parameter Configuration Guidance

Open cravirajan opened this issue 5 months ago • 5 comments

Issue Category: Model Performance & Configuration

Detailed Description:

Current Setup

  • Infrastructure: GPU-supported EC2 instances
  • Implementation: FastAPI wrapper on top of VILA inference command
  • Problem: Significant performance gap compared to NVIDIA VILA API responses

Specific Issues

  1. Performance Discrepancy:

    • Self-deployed VILA models showing inferior results compared to NVIDIA VILA API
    • Suspect NVIDIA API may be using larger/more trained models (potentially >40B parameters)
    • Note: As of January 6, 2025 VILA is now part of the new Cosmos Nemotron vision language models
  2. Missing Parameter Configuration:

    • Unable to configure inference parameters in current FastAPI implementation
    • Need to pass: temperature, top_p, seed values to deployed model
    • Current setup doesn't support these sampling parameters

Questions for Support Team

  1. Model Specifications:

    • What are the exact model parameters/versions used in NVIDIA VILA API?
    • Are there larger parameter models (>40B) available that aren't in public repositories?
    • Is there will be any change if self hosted api build directly on top of ec2 without using the nvidia providing metropolis services and all?
  2. Parameter Configuration:

    • How to properly implement temperature, top_p, and seed parameters in VILA inference?
    • Best practices for FastAPI wrapper configuration with these parameters

Expected Resolution

  • Clear documentation on implementing sampling parameters (temperature, top_p, seed)
  • Guidance on model selection to match NVIDIA API performance
  • Best practices for FastAPI deployment with proper parameter support

cravirajan avatar Jun 17 '25 10:06 cravirajan