VLM Test Results Significantly Lower Than Paper-Reported Performance

Open tianxiang0521 opened this issue 7 months ago • 0 comments

Issue: VLM Test Results Significantly Lower Than Paper-Reported Performance

Summary

Our experimental results show a significant performance gap compared to the metrics reported in the "Explore until Confident" paper. Test results achieve ~49% success rate while the paper reports 58.4% baseline performance with Prismatic VLM.

Experimental Setup

Test Cases: 500 scenarios from HM-EQA dataset
Model Configuration: prism-dinosiglip+7b with dinosiglip-vit-so-384px vision backbone and llama2-7b-pure language model
Runtime: 42 hours 26 minutes total (~305 seconds per test case)

Results Comparison

Metric	Our Test Results	Paper Reported
Weighted Success Rate	243/500 (48.6%)	58.4% (base)
Max Success Rate	245/500 (49.0%)	~60% (max time steps)
With Fine-tuning	Not tested	68.1%
With GPT4-V	Not tested	73.9%

Performance Gap

~9.4 percentage points lower than paper's baseline Prismatic VLM performance
Significantly below the reported "around 60% with maximum time steps"

Possible Causes

Model Configuration Differences:
- Test used prism-dinosiglip+7b - unclear if this matches paper's Prismatic variant
- Different vision backbone or language model versions
Fine-tuning Status:
- Paper mentions 68.1% after fine-tuning (improved from 56.2%)
- Our test may be using base/non-fine-tuned models
Evaluation Protocol:
- Different stopping criteria implementation
- Different time step normalization
- Possible differences in semantic exploration parameters
Dataset Version:
- Potential differences in HM-EQA dataset version or preprocessing

Questions for Reproduction

Model Specification: What exact Prismatic VLM configuration was used in the paper experiments?
Fine-tuning: Were the reported 58.4% results from fine-tuned or base models?
Hyperparameters: What were the specific values for:
- Temperature scaling (τ_LSV, τ_GSV)
- Semantic value weights
- Stopping criteria thresholds
Evaluation Setup:
- Exact time step normalization method
- Frontier sampling implementation details

Request

Could you please provide:

The exact model configuration and checkpoint used for the 58.4% baseline results
Training/fine-tuning details and data splits
Complete hyperparameter settings
Any preprocessing steps or evaluation protocol details that might affect results

This would help ensure proper reproduction of the paper's results and identify the source of the performance discrepancy.

Additional Notes

The paper notes that performance scales with VLM capabilities, mentioning improvements with LLaVA 1.6 (65.3%) and GPT4-V (73.9%). However, even accounting for model differences, the gap between our base results and paper's base results suggests systematic differences in experimental setup.

Jun 06 '25 14:06 tianxiang0521