VLM Test Results Significantly Lower Than Paper-Reported Performance
Issue: VLM Test Results Significantly Lower Than Paper-Reported Performance
Summary
Our experimental results show a significant performance gap compared to the metrics reported in the "Explore until Confident" paper. Test results achieve ~49% success rate while the paper reports 58.4% baseline performance with Prismatic VLM.
Experimental Setup
- Test Cases: 500 scenarios from HM-EQA dataset
- Model Configuration:
prism-dinosiglip+7bwithdinosiglip-vit-so-384pxvision backbone andllama2-7b-purelanguage model - Runtime: 42 hours 26 minutes total (~305 seconds per test case)
Results Comparison
| Metric | Our Test Results | Paper Reported |
|---|---|---|
| Weighted Success Rate | 243/500 (48.6%) | 58.4% (base) |
| Max Success Rate | 245/500 (49.0%) | ~60% (max time steps) |
| With Fine-tuning | Not tested | 68.1% |
| With GPT4-V | Not tested | 73.9% |
Performance Gap
- ~9.4 percentage points lower than paper's baseline Prismatic VLM performance
- Significantly below the reported "around 60% with maximum time steps"
Possible Causes
-
Model Configuration Differences:
- Test used
prism-dinosiglip+7b- unclear if this matches paper's Prismatic variant - Different vision backbone or language model versions
- Test used
-
Fine-tuning Status:
- Paper mentions 68.1% after fine-tuning (improved from 56.2%)
- Our test may be using base/non-fine-tuned models
-
Evaluation Protocol:
- Different stopping criteria implementation
- Different time step normalization
- Possible differences in semantic exploration parameters
-
Dataset Version:
- Potential differences in HM-EQA dataset version or preprocessing
Questions for Reproduction
- Model Specification: What exact Prismatic VLM configuration was used in the paper experiments?
- Fine-tuning: Were the reported 58.4% results from fine-tuned or base models?
- Hyperparameters: What were the specific values for:
- Temperature scaling (τ_LSV, τ_GSV)
- Semantic value weights
- Stopping criteria thresholds
- Evaluation Setup:
- Exact time step normalization method
- Frontier sampling implementation details
Request
Could you please provide:
- The exact model configuration and checkpoint used for the 58.4% baseline results
- Training/fine-tuning details and data splits
- Complete hyperparameter settings
- Any preprocessing steps or evaluation protocol details that might affect results
This would help ensure proper reproduction of the paper's results and identify the source of the performance discrepancy.
Additional Notes
The paper notes that performance scales with VLM capabilities, mentioning improvements with LLaVA 1.6 (65.3%) and GPT4-V (73.9%). However, even accounting for model differences, the gap between our base results and paper's base results suggests systematic differences in experimental setup.