Description of the bug:

Gemini 2.5 Flash Model Inconsistency Between API and Online Interface

Issue Summary

I'm experiencing significant inconsistencies in responses from Gemini 2.5 Flash when using identical inputs across different interfaces (API vs online interface called "Spatial Understanding") and implementations (React vs Python), despite using the exact same model, configuration, and prompts.

Environment Details

Model: gemini-2.5-flash (via API) vs online Gemini interface
Task: Image segmentation with spatial understanding
Languages: JavaScript (React) and Python implementations
Libraries:
- React: @google/genai JavaScript SDK
- Python: google-genai Python SDK

Reproduction Steps

1. React Implementation (Google's Official Example)

// Using Google's official spatial understanding example
const ai = new GoogleGenAI({apiKey: process.env.GEMINI_API_KEY});

const response = await ai.models.generateContent({
  model: 'models/gemini-2.5-flash',
  contents: [
    {
      role: 'user',
      parts: [
        {
          inlineData: {
            data: imageBase64, // Same image across all tests
            mimeType: 'image/png',
          },
        },
        {
          text: "Give the segmentation masks for the very specific damaged/destroyed/eroded parts of the object. Output a JSON list of segmentation masks where each entry contains the 2D bounding box in the key \"box_2d\", the segmentation mask in key \"mask\", and the text label in the key \"label\". Use descriptive labels."
        },
      ],
    },
  ],
  config: {
    temperature: 0.5,
    thinkingConfig: {thinkingBudget: 0}
  },
});

2. Python Implementation

from google import genai
from google.genai import types

client = genai.Client()

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        "Give the segmentation masks for all objects. Output a JSON list of segmentation masks where each entry contains the 2D bounding box in the key \"box_2d\", the segmentation mask in key \"mask\", and the text label in the key \"label\". Use descriptive labels.",
        types.Part.from_bytes(data=image_bytes, mime_type="image/png")
    ],
    config=types.GenerateContentConfig(
        temperature=0.5
    )
)

3. Online Interface Test

Navigate to Google AI Studio or Gemini online interface
Upload the same image
Use the exact same prompt text
Set temperature to 0.5
Select Gemini 2.5 Flash model

Observed Inconsistencies

Different object detection: Some objects detected in one interface but not others (across various tests on the same size, this should not occur)
Different labels: Same objects labeled differently ("cracked lens" vs "destroyed camera") (this could occur expected due to the non-deterministic nature)
Different bounding boxes: Varying coordinates for the same objects (minor deviations could be observed, due to the non-deterministic nature)
Different segmentation masks: Different mask boundaries for identical objects (minor deviations could be observed, due to the non-deterministic nature)

Potential Causes

Model Version Differences: The API might be serving a different version/checkpoint of Gemini 2.5 Flash than the online interface
Pre/Post-processing Differences: Different image preprocessing or response post-processing between interfaces
Infrastructure Differences: Different serving infrastructure with different model weights
Hidden Parameters: Undocumented parameters that differ between interfaces

Additional Context

This issue was discovered while implementing Google's official spatial understanding example
The problem persists even with temperature: 0 (should be deterministic)
Same API key used across all tests
Images are processed identically (same base64 encoding, same dimensions)

Expected: Consistent model behavior across all interfaces when using identical inputs Actual: Significant variations in responses despite identical configuration

Request

Clarify if the API and online interface use identical model versions
Investigate why the same model produces different results across interfaces
Document any differences in preprocessing, postprocessing, or default parameters
Provide guidance on achieving consistent results across platforms
Consider adding model version/checkpoint identifiers to API responses for transparency

Actual vs expected behavior:

Expected Behavior

When using the same:

Model (gemini-2.5-flash)
System prompt (identical text)
Image (same file)
Temperature (same value)
Configuration (thinkingBudget: 0)

The responses should be consistent or at least very similar across all interfaces and implementations.

Actual Behavior

Inconsistent responses are observed across:

✅ React API implementation → Response A
✅ Python API implementation → Response B (different from A)
✅ Online Gemini interface → Response C (different from A and B)

All using identical inputs but producing different segmentation results.

Aug 20 '25 10:08 cantonioupao

Thanks for the detailed report. Ideally, the same model is used across both the API and AI Studio. But, by design, LLM are generally non-deterministic, so some variation in outputs can occur, even with the same input.

For more deterministic and consistent output, try providing more detailed and structured system instructions (e.g., well-defined class labels).

Aug 26 '25 05:08 Gunand3043

@Gunand3043 Thank you for the response, but I believe my original issue may not have been fully understood.

I did acknowledge that LLMs are non-deterministic and mentioned that minor variations are expected. However, the inconsistencies I'm observing go beyond normal variance - I'm seeing completely different objects being detected across platforms, not just different labels.

Even with temperature: 0 (which should minimize randomness), the structural differences persist across your platforms.

Could you help clarify: Are the API and online interface using identical model weights/checkpoints for gemini-2.5-flash?

My concern isn't about output quality or instruction detail, but rather about platform consistency. When identical configurations produce significantly different object detection results across Google's own interfaces, this suggests a potential infrastructure discrepancy that would be valuable to understand.

I'd appreciate any technical insight you can provide on this.

Aug 28 '25 13:08 cantonioupao

Experiencing same thing. Would appreciate any insight on this and how to overcome it.

Nov 07 '25 10:11 chinchang

Experiencing same thing too.

Nov 18 '25 06:11 sodarfish

Gemini 2.5 Flash Model Inconsistency Between API and Online Interface (Spatial Understanding)

Description of the bug:

Gemini 2.5 Flash Model Inconsistency Between API and Online Interface

Issue Summary

Environment Details

Reproduction Steps

1. React Implementation (Google's Official Example)

2. Python Implementation

3. Online Interface Test

Observed Inconsistencies

Potential Causes

Additional Context

Request

Actual vs expected behavior:

Expected Behavior

Actual Behavior