explore-eqa icon indicating copy to clipboard operation
explore-eqa copied to clipboard

VLM Test Results Significantly Lower Than Paper-Reported Performance

Open tianxiang0521 opened this issue 7 months ago • 0 comments

Issue: VLM Test Results Significantly Lower Than Paper-Reported Performance

Summary

Our experimental results show a significant performance gap compared to the metrics reported in the "Explore until Confident" paper. Test results achieve ~49% success rate while the paper reports 58.4% baseline performance with Prismatic VLM.

Experimental Setup

  • Test Cases: 500 scenarios from HM-EQA dataset
  • Model Configuration: prism-dinosiglip+7b with dinosiglip-vit-so-384px vision backbone and llama2-7b-pure language model
  • Runtime: 42 hours 26 minutes total (~305 seconds per test case)

Results Comparison

Metric Our Test Results Paper Reported
Weighted Success Rate 243/500 (48.6%) 58.4% (base)
Max Success Rate 245/500 (49.0%) ~60% (max time steps)
With Fine-tuning Not tested 68.1%
With GPT4-V Not tested 73.9%

Performance Gap

  • ~9.4 percentage points lower than paper's baseline Prismatic VLM performance
  • Significantly below the reported "around 60% with maximum time steps"

Possible Causes

  1. Model Configuration Differences:

    • Test used prism-dinosiglip+7b - unclear if this matches paper's Prismatic variant
    • Different vision backbone or language model versions
  2. Fine-tuning Status:

    • Paper mentions 68.1% after fine-tuning (improved from 56.2%)
    • Our test may be using base/non-fine-tuned models
  3. Evaluation Protocol:

    • Different stopping criteria implementation
    • Different time step normalization
    • Possible differences in semantic exploration parameters
  4. Dataset Version:

    • Potential differences in HM-EQA dataset version or preprocessing

Questions for Reproduction

  1. Model Specification: What exact Prismatic VLM configuration was used in the paper experiments?
  2. Fine-tuning: Were the reported 58.4% results from fine-tuned or base models?
  3. Hyperparameters: What were the specific values for:
    • Temperature scaling (τ_LSV, τ_GSV)
    • Semantic value weights
    • Stopping criteria thresholds
  4. Evaluation Setup:
    • Exact time step normalization method
    • Frontier sampling implementation details

Request

Could you please provide:

  1. The exact model configuration and checkpoint used for the 58.4% baseline results
  2. Training/fine-tuning details and data splits
  3. Complete hyperparameter settings
  4. Any preprocessing steps or evaluation protocol details that might affect results

This would help ensure proper reproduction of the paper's results and identify the source of the performance discrepancy.

Additional Notes

The paper notes that performance scales with VLM capabilities, mentioning improvements with LLaVA 1.6 (65.3%) and GPT4-V (73.9%). However, even accounting for model differences, the gap between our base results and paper's base results suggests systematic differences in experimental setup.

tianxiang0521 avatar Jun 06 '25 14:06 tianxiang0521