cookbook icon indicating copy to clipboard operation
cookbook copied to clipboard

Gemini 2.5 Flash Model Inconsistency Between API and Online Interface (Spatial Understanding)

Open cantonioupao opened this issue 4 months ago • 4 comments

Description of the bug:

Gemini 2.5 Flash Model Inconsistency Between API and Online Interface

Issue Summary

I'm experiencing significant inconsistencies in responses from Gemini 2.5 Flash when using identical inputs across different interfaces (API vs online interface called "Spatial Understanding") and implementations (React vs Python), despite using the exact same model, configuration, and prompts.

Environment Details

  • Model: gemini-2.5-flash (via API) vs online Gemini interface
  • Task: Image segmentation with spatial understanding
  • Languages: JavaScript (React) and Python implementations
  • Libraries:
    • React: @google/genai JavaScript SDK
    • Python: google-genai Python SDK

Reproduction Steps

1. React Implementation (Google's Official Example)

// Using Google's official spatial understanding example
const ai = new GoogleGenAI({apiKey: process.env.GEMINI_API_KEY});

const response = await ai.models.generateContent({
  model: 'models/gemini-2.5-flash',
  contents: [
    {
      role: 'user',
      parts: [
        {
          inlineData: {
            data: imageBase64, // Same image across all tests
            mimeType: 'image/png',
          },
        },
        {
          text: "Give the segmentation masks for the very specific damaged/destroyed/eroded parts of the object. Output a JSON list of segmentation masks where each entry contains the 2D bounding box in the key \"box_2d\", the segmentation mask in key \"mask\", and the text label in the key \"label\". Use descriptive labels."
        },
      ],
    },
  ],
  config: {
    temperature: 0.5,
    thinkingConfig: {thinkingBudget: 0}
  },
});

2. Python Implementation

from google import genai
from google.genai import types

client = genai.Client()

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        "Give the segmentation masks for all objects. Output a JSON list of segmentation masks where each entry contains the 2D bounding box in the key \"box_2d\", the segmentation mask in key \"mask\", and the text label in the key \"label\". Use descriptive labels.",
        types.Part.from_bytes(data=image_bytes, mime_type="image/png")
    ],
    config=types.GenerateContentConfig(
        temperature=0.5
    )
)

3. Online Interface Test

  • Navigate to Google AI Studio or Gemini online interface
  • Upload the same image
  • Use the exact same prompt text
  • Set temperature to 0.5
  • Select Gemini 2.5 Flash model

Observed Inconsistencies

  1. Different object detection: Some objects detected in one interface but not others (across various tests on the same size, this should not occur)
  2. Different labels: Same objects labeled differently ("cracked lens" vs "destroyed camera") (this could occur expected due to the non-deterministic nature)
  3. Different bounding boxes: Varying coordinates for the same objects (minor deviations could be observed, due to the non-deterministic nature)
  4. Different segmentation masks: Different mask boundaries for identical objects (minor deviations could be observed, due to the non-deterministic nature)

Potential Causes

  1. Model Version Differences: The API might be serving a different version/checkpoint of Gemini 2.5 Flash than the online interface
  2. Pre/Post-processing Differences: Different image preprocessing or response post-processing between interfaces
  3. Infrastructure Differences: Different serving infrastructure with different model weights
  4. Hidden Parameters: Undocumented parameters that differ between interfaces

Additional Context

  • This issue was discovered while implementing Google's official spatial understanding example
  • The problem persists even with temperature: 0 (should be deterministic)
  • Same API key used across all tests
  • Images are processed identically (same base64 encoding, same dimensions)

Expected: Consistent model behavior across all interfaces when using identical inputs Actual: Significant variations in responses despite identical configuration

Image Image

Request

  1. Clarify if the API and online interface use identical model versions
  2. Investigate why the same model produces different results across interfaces
  3. Document any differences in preprocessing, postprocessing, or default parameters
  4. Provide guidance on achieving consistent results across platforms
  5. Consider adding model version/checkpoint identifiers to API responses for transparency

Actual vs expected behavior:

Expected Behavior

When using the same:

  • Model (gemini-2.5-flash)
  • System prompt (identical text)
  • Image (same file)
  • Temperature (same value)
  • Configuration (thinkingBudget: 0)

The responses should be consistent or at least very similar across all interfaces and implementations.

Actual Behavior

Inconsistent responses are observed across:

  1. React API implementation → Response A
  2. Python API implementation → Response B (different from A)
  3. Online Gemini interface → Response C (different from A and B)

All using identical inputs but producing different segmentation results.

cantonioupao avatar Aug 20 '25 10:08 cantonioupao

Thanks for the detailed report. Ideally, the same model is used across both the API and AI Studio. But, by design, LLM are generally non-deterministic, so some variation in outputs can occur, even with the same input.

For more deterministic and consistent output, try providing more detailed and structured system instructions (e.g., well-defined class labels).

Gunand3043 avatar Aug 26 '25 05:08 Gunand3043

@Gunand3043 Thank you for the response, but I believe my original issue may not have been fully understood.

I did acknowledge that LLMs are non-deterministic and mentioned that minor variations are expected. However, the inconsistencies I'm observing go beyond normal variance - I'm seeing completely different objects being detected across platforms, not just different labels.

Even with temperature: 0 (which should minimize randomness), the structural differences persist across your platforms.

Could you help clarify: Are the API and online interface using identical model weights/checkpoints for gemini-2.5-flash?

My concern isn't about output quality or instruction detail, but rather about platform consistency. When identical configurations produce significantly different object detection results across Google's own interfaces, this suggests a potential infrastructure discrepancy that would be valuable to understand.

I'd appreciate any technical insight you can provide on this.

cantonioupao avatar Aug 28 '25 13:08 cantonioupao

Experiencing same thing. Would appreciate any insight on this and how to overcome it.

chinchang avatar Nov 07 '25 10:11 chinchang

Experiencing same thing too.

sodarfish avatar Nov 18 '25 06:11 sodarfish