Gemini 2.5 Flash Model Inconsistency Between API and Online Interface (Spatial Understanding)
Description of the bug:
Gemini 2.5 Flash Model Inconsistency Between API and Online Interface
Issue Summary
I'm experiencing significant inconsistencies in responses from Gemini 2.5 Flash when using identical inputs across different interfaces (API vs online interface called "Spatial Understanding") and implementations (React vs Python), despite using the exact same model, configuration, and prompts.
Environment Details
- Model:
gemini-2.5-flash(via API) vs online Gemini interface - Task: Image segmentation with spatial understanding
- Languages: JavaScript (React) and Python implementations
- Libraries:
- React:
@google/genaiJavaScript SDK - Python:
google-genaiPython SDK
- React:
Reproduction Steps
1. React Implementation (Google's Official Example)
// Using Google's official spatial understanding example
const ai = new GoogleGenAI({apiKey: process.env.GEMINI_API_KEY});
const response = await ai.models.generateContent({
model: 'models/gemini-2.5-flash',
contents: [
{
role: 'user',
parts: [
{
inlineData: {
data: imageBase64, // Same image across all tests
mimeType: 'image/png',
},
},
{
text: "Give the segmentation masks for the very specific damaged/destroyed/eroded parts of the object. Output a JSON list of segmentation masks where each entry contains the 2D bounding box in the key \"box_2d\", the segmentation mask in key \"mask\", and the text label in the key \"label\". Use descriptive labels."
},
],
},
],
config: {
temperature: 0.5,
thinkingConfig: {thinkingBudget: 0}
},
});
2. Python Implementation
from google import genai
from google.genai import types
client = genai.Client()
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=[
"Give the segmentation masks for all objects. Output a JSON list of segmentation masks where each entry contains the 2D bounding box in the key \"box_2d\", the segmentation mask in key \"mask\", and the text label in the key \"label\". Use descriptive labels.",
types.Part.from_bytes(data=image_bytes, mime_type="image/png")
],
config=types.GenerateContentConfig(
temperature=0.5
)
)
3. Online Interface Test
- Navigate to Google AI Studio or Gemini online interface
- Upload the same image
- Use the exact same prompt text
- Set temperature to 0.5
- Select Gemini 2.5 Flash model
Observed Inconsistencies
- Different object detection: Some objects detected in one interface but not others (across various tests on the same size, this should not occur)
- Different labels: Same objects labeled differently ("cracked lens" vs "destroyed camera") (this could occur expected due to the non-deterministic nature)
- Different bounding boxes: Varying coordinates for the same objects (minor deviations could be observed, due to the non-deterministic nature)
- Different segmentation masks: Different mask boundaries for identical objects (minor deviations could be observed, due to the non-deterministic nature)
Potential Causes
- Model Version Differences: The API might be serving a different version/checkpoint of Gemini 2.5 Flash than the online interface
- Pre/Post-processing Differences: Different image preprocessing or response post-processing between interfaces
- Infrastructure Differences: Different serving infrastructure with different model weights
- Hidden Parameters: Undocumented parameters that differ between interfaces
Additional Context
- This issue was discovered while implementing Google's official spatial understanding example
- The problem persists even with
temperature: 0(should be deterministic) - Same API key used across all tests
- Images are processed identically (same base64 encoding, same dimensions)
Expected: Consistent model behavior across all interfaces when using identical inputs Actual: Significant variations in responses despite identical configuration
Request
- Clarify if the API and online interface use identical model versions
- Investigate why the same model produces different results across interfaces
- Document any differences in preprocessing, postprocessing, or default parameters
- Provide guidance on achieving consistent results across platforms
- Consider adding model version/checkpoint identifiers to API responses for transparency
Actual vs expected behavior:
Expected Behavior
When using the same:
- Model (
gemini-2.5-flash) - System prompt (identical text)
- Image (same file)
- Temperature (same value)
- Configuration (
thinkingBudget: 0)
The responses should be consistent or at least very similar across all interfaces and implementations.
Actual Behavior
Inconsistent responses are observed across:
- ✅ React API implementation → Response A
- ✅ Python API implementation → Response B (different from A)
- ✅ Online Gemini interface → Response C (different from A and B)
All using identical inputs but producing different segmentation results.
Thanks for the detailed report. Ideally, the same model is used across both the API and AI Studio. But, by design, LLM are generally non-deterministic, so some variation in outputs can occur, even with the same input.
For more deterministic and consistent output, try providing more detailed and structured system instructions (e.g., well-defined class labels).
@Gunand3043 Thank you for the response, but I believe my original issue may not have been fully understood.
I did acknowledge that LLMs are non-deterministic and mentioned that minor variations are expected. However, the inconsistencies I'm observing go beyond normal variance - I'm seeing completely different objects being detected across platforms, not just different labels.
Even with temperature: 0 (which should minimize randomness), the structural differences persist across your platforms.
Could you help clarify: Are the API and online interface using identical model weights/checkpoints for gemini-2.5-flash?
My concern isn't about output quality or instruction detail, but rather about platform consistency. When identical configurations produce significantly different object detection results across Google's own interfaces, this suggests a potential infrastructure discrepancy that would be valuable to understand.
I'd appreciate any technical insight you can provide on this.
Experiencing same thing. Would appreciate any insight on this and how to overcome it.
Experiencing same thing too.