fix(gemini): gemini input token calculation when implicit cache is hit using langchain
Context
For our gemini usage (using Langchain through VertexAI), we learned that costs for cached tokens is not correctly calculated. We traced this back to cached tokens not being correctly subtracted from input token count, because the input tokens were reported in input_modality_1, where cached tokens are not being subtracted at all.
Observations (Current state)
- When
input_modality_1contains tokens,inputtoken count is 0. - The cached token logic only subtracts cached tokens from
input, when they should be subtracted from theinput_modality_1.
Before fix:
Impact
Price calculations are significantly off, when caching is used via VertexAI (with Langchain). In the above example we're taking about 23% deviation, but in cases where input tokens are the main cost & we're making heavy use of caching the calculation can be off by more than 50%.
Proposed fix
Subtract cache_tokens_details from the corresponding input_modality in addition to subtracting from input.
We expect since this is only being applied to the specific input_modality that there should not be any unexpected side-effects from this change.
After fix:
Verification
I've validated this with a modified version of the langfuse.langchain.CallbackHandler.py against our Langfuse Cloud app.
[!IMPORTANT] Fixes token cost calculation by subtracting cached tokens from
input_modality_1inCallbackHandler.py, correcting significant price deviations with VertexAI caching.
- Behavior:
- Fixes token cost calculation by subtracting
cache_tokens_detailsfrominput_modality_1in_parse_usage_model()inCallbackHandler.py.- Ensures cached tokens are subtracted from both
inputandinput_modality_1.- Impact:
- Corrects significant price calculation deviations (up to 50%) when using caching with VertexAI and Langchain.
- Verification:
- Validated with a modified version of
langfuse.langchain.CallbackHandler.pyagainst Langfuse Cloud app.This description was created by
for 57266972ef232f89439fd4d49551f49de01d9350. You can customize this summary. It will automatically update as commits are pushed.
Disclaimer: Experimental PR review
Greptile Overview
Greptile Summary
Fixes a critical bug in Gemini/Vertex AI cached token calculation when using Langchain. When cached tokens are present and input tokens are reported in input_modality_{modality} fields (rather than the generic input field), the previous code only subtracted cached tokens from input, leaving input_modality_{modality} inflated. This caused cost calculations to be off by 23-50%+ when caching was used.
Key changes:
- Added logic to subtract cached tokens from the corresponding
input_modality_{modality}field in addition to theinputfield - Maintains consistency with how
prompt_tokens_detailsandcandidates_tokens_detailsare already handled - Uses
max(0, ...)to prevent negative token counts
Impact:
- Fixes significantly incorrect cost calculations for Vertex AI/Gemini usage with caching enabled
- No impact on non-cached requests or other providers
Confidence Score: 5/5
- This PR is safe to merge with minimal risk
- The fix is surgical and well-targeted: it adds 2 lines that mirror the existing pattern used throughout the same function. The logic correctly subtracts cached tokens from the modality-specific input field using the same
max(0, ...)safeguard pattern. The change only affects Vertex AI/Gemini scenarios where cache_tokens_details AND input_modality fields both exist, making it highly isolated with no risk to other providers or non-cached scenarios. - No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| langfuse/langchain/CallbackHandler.py | 5/5 | Fixed Gemini cached token calculation by subtracting cache tokens from input_modality in addition to input field |
Sequence Diagram
sequenceDiagram
participant LC as Langchain
participant CB as CallbackHandler
participant PU as _parse_usage_model
participant LF as Langfuse
LC->>CB: on_llm_end(response)
CB->>PU: _parse_usage(response)
Note over PU: Extract usage data from response
alt Has cache_tokens_details (Vertex AI)
PU->>PU: Extract cache token details
PU->>PU: Create cached_modality_{modality} field
alt input field exists
PU->>PU: Subtract cached tokens from input
end
alt input_modality_{modality} exists
PU->>PU: Subtract cached tokens from input_modality
Note over PU: FIX: Ensures accurate token<br/>count when input is in modality
end
end
PU-->>CB: Return usage_model with corrected tokens
CB->>LF: Update generation with usage
Note over LF: Cost calculated from<br/>corrected token counts