fix(gemini): gemini input token calculation when implicit cache is hit using langchain

Open aablsk opened this issue 5 months ago • 0 comments

Context

For our gemini usage (using Langchain through VertexAI), we learned that costs for cached tokens is not correctly calculated. We traced this back to cached tokens not being correctly subtracted from input token count, because the input tokens were reported in input_modality_1, where cached tokens are not being subtracted at all.

Observations (Current state)

When input_modality_1 contains tokens, input token count is 0.
The cached token logic only subtracts cached tokens from input, when they should be subtracted from the input_modality_1.

Before fix:

Impact

Price calculations are significantly off, when caching is used via VertexAI (with Langchain). In the above example we're taking about 23% deviation, but in cases where input tokens are the main cost & we're making heavy use of caching the calculation can be off by more than 50%.

Proposed fix

Subtract cache_tokens_details from the corresponding input_modality in addition to subtracting from input. We expect since this is only being applied to the specific input_modality that there should not be any unexpected side-effects from this change.

After fix: Screenshot 2025-11-24 at 17 38 08

Verification

I've validated this with a modified version of the langfuse.langchain.CallbackHandler.py against our Langfuse Cloud app.

[!IMPORTANT] Fixes token cost calculation by subtracting cached tokens from input_modality_1 in CallbackHandler.py, correcting significant price deviations with VertexAI caching.

Behavior:

Fixes token cost calculation by subtracting cache_tokens_details from input_modality_1 in _parse_usage_model() in CallbackHandler.py.

Ensures cached tokens are subtracted from both input and input_modality_1.

Impact:

Corrects significant price calculation deviations (up to 50%) when using caching with VertexAI and Langchain.

Verification:

Validated with a modified version of langfuse.langchain.CallbackHandler.py against Langfuse Cloud app.

^{This description was created by}^{for 57266972ef232f89439fd4d49551f49de01d9350. You can customize this summary. It will automatically update as commits are pushed.}

Disclaimer: Experimental PR review

Greptile Overview

Greptile Summary

Fixes a critical bug in Gemini/Vertex AI cached token calculation when using Langchain. When cached tokens are present and input tokens are reported in input_modality_{modality} fields (rather than the generic input field), the previous code only subtracted cached tokens from input, leaving input_modality_{modality} inflated. This caused cost calculations to be off by 23-50%+ when caching was used.

Key changes:

Added logic to subtract cached tokens from the corresponding input_modality_{modality} field in addition to the input field
Maintains consistency with how prompt_tokens_details and candidates_tokens_details are already handled
Uses max(0, ...) to prevent negative token counts

Impact:

Fixes significantly incorrect cost calculations for Vertex AI/Gemini usage with caching enabled
No impact on non-cached requests or other providers

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The fix is surgical and well-targeted: it adds 2 lines that mirror the existing pattern used throughout the same function. The logic correctly subtracts cached tokens from the modality-specific input field using the same max(0, ...) safeguard pattern. The change only affects Vertex AI/Gemini scenarios where cache_tokens_details AND input_modality fields both exist, making it highly isolated with no risk to other providers or non-cached scenarios.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
langfuse/langchain/CallbackHandler.py	5/5	Fixed Gemini cached token calculation by subtracting cache tokens from input_modality in addition to input field

Sequence Diagram

sequenceDiagram
    participant LC as Langchain
    participant CB as CallbackHandler
    participant PU as _parse_usage_model
    participant LF as Langfuse

    LC->>CB: on_llm_end(response)
    CB->>PU: _parse_usage(response)
    
    Note over PU: Extract usage data from response
    
    alt Has cache_tokens_details (Vertex AI)
        PU->>PU: Extract cache token details
        PU->>PU: Create cached_modality_{modality} field
        
        alt input field exists
            PU->>PU: Subtract cached tokens from input
        end
        
        alt input_modality_{modality} exists
            PU->>PU: Subtract cached tokens from input_modality
            Note over PU: FIX: Ensures accurate token<br/>count when input is in modality
        end
    end
    
    PU-->>CB: Return usage_model with corrected tokens
    CB->>LF: Update generation with usage
    Note over LF: Cost calculated from<br/>corrected token counts

Nov 24 '25 16:11 aablsk