openllmetry
openllmetry copied to clipboard
fix(bedrock): Add prompt caching support for Converse API
- [x] I have added tests that cover my changes.
- [ ] If adding a new instrumentation or changing an existing one, I've added screenshots from some observability platform showing the change.
- [x] PR name follows conventional commits format:
feat(instrumentation): ...orfix(instrumentation): .... - [ ] (If applicable) I have updated the documentation accordingly.
Description
This PR introduces prompt caching telemetry for the AWS Bedrock Converse and Converse Stream APIs, bringing feature parity with the existing invoke_model instrumentation.
The Converse API reports caching information in the usage field of the response body, rather than through HTTP headers. This implementation adds the necessary logic to parse this information and record it as metrics and span attributes.
Changes include:
- New function
prompt_caching_converse_handlinginprompt_caching.pyto extractcache_read_input_tokensandcache_creation_input_tokensfrom the response body. - Integration into
__init__.py: The new function is now called from_handle_converseand_handle_converse_streamto process caching data for both standard and streaming calls. - New Test File: Added
test_bedrock_converse_prompt_caching_metrics.pyto validate that thegen_ai.prompt.cachingmetric is correctly emitted for the Converse API.
Fixes #3337
[!IMPORTANT] Adds prompt caching telemetry for AWS Bedrock Converse APIs, including new function for caching data extraction and corresponding tests.
- Behavior:
- Adds
prompt_caching_converse_handlinginprompt_caching.pyto extract caching data from Converse API response body.- Integrates
prompt_caching_converse_handlinginto_handle_converseand_handle_converse_streamin__init__.py.- Testing:
- Adds
test_bedrock_converse_prompt_caching_metrics.pyto validategen_ai.prompt.cachingmetric emission for Converse API.- Misc:
- Fixes #3337.
This description was created by
for 4fa37924888b5501ff40e9becdb27551deb01de6. You can customize this summary. It will automatically update as commits are pushed.
Summary by CodeRabbit
-
New Features
- Prompt-caching detection for Converse (streaming and non‑streaming), recording cache read/write state and token-level usage into telemetry and metrics.
-
Tests
- Added tests validating cache write and cache read scenarios, span attributes, and aggregated token metrics.
- Added a test cassette exercising a Converse error response path.
Hi @nirga
I’ve resolved the lint test failures. The remaining failing test, test_prompt_cache_converse, is expected since it requires a VCR cassette to be recorded.
As I don’t have access to an active AWS account, I’m unable to generate the test_prompt_cache_converse.yaml cassette file myself. Would you be able to check out this branch, run the test and push the generated cassette file to this PR?
Thanks for your help!
Thanks for the great suggestion and for your willingness to help record the test!
I agree that relying on an existing test is a cleaner approach. Before I push the changes, I just want to confirm my plan sounds good to you.
Here is what I am planning to do:
- Modify the Existing Test: I will update the
test_titan_conversefunction intests/traces/test_titan.py. - Enable Caching: I'll add the
additionalModelRequestFieldswithcacheControlto the existingbrt.converseAPI call. - Test Both Scenarios: I will add a second
brt.conversecall within that same test to ensure we cover both the initial "cache-write" and the subsequent "cache-read". - Add Assertions: I will add the metric assertions I wrote to validate that the prompt caching counters are working correctly.
- Clean Up: Finally, I will delete the new test file I originally created (
test_bedrock_converse_prompt_caching_metrics.py).
This will result in the cassette for test_titan_converse.yaml needing to be re-recorded, as you mentioned.
Does this plan look good? If so, I'll go ahead and make the changes.
@nirga any update on this?
[!NOTE]
Other AI code review bot(s) detected
CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.
Walkthrough
Adds response/metadata-based prompt caching handling for Bedrock Converse and invokes it from both synchronous and streaming converse flows; introduces a new handler that reads cache read/write token counts from responses/metadata, sets span attributes, and emits cache metrics when enabled.
Changes
| Cohort / File(s) | Summary |
|---|---|
Instrumentation entrypoints packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py |
Imported prompt_caching_converse_handling (and prompt_caching_handling) and added calls to prompt_caching_converse_handling(...) in _handle_converse() and in the final-message path of _handle_converse_stream() after guardrail handling. |
Prompt caching logic packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py |
Added prompt_caching_converse_handling(response, vendor, model, metric_params) which extracts cache read/write token counts from a response or metadata object, verifies current span presence and recording, sets CACHED span attribute and gen_ai.usage.* token attributes, and emits cache read/write metrics when configured. |
Tests & cassettes packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py, packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py, packages/opentelemetry-instrumentation-bedrock/tests/traces/cassettes/test_titan/test_titan_converse_with_caching.yaml |
Added metric helper functions and test_anthropic_converse_with_caching to validate caching read/write behaviour (note: duplicate test definition present); minor formatting tweak in test_titan.py; added a Titan Converse cassette capturing a 403 auth-failure response for converse-with-caching scenario. |
Sequence Diagram(s)
sequenceDiagram
participant Client
participant BedrockInstrumentor
participant BedrockAPI
participant PromptCacheHandler
participant Span
participant Metrics
Client->>BedrockInstrumentor: invoke_converse(...)
BedrockInstrumentor->>BedrockAPI: call Converse (sync or stream)
BedrockAPI-->>BedrockInstrumentor: response / streamed events (include usage_metadata or metadata)
alt sync final response
BedrockInstrumentor->>PromptCacheHandler: prompt_caching_converse_handling(response, vendor, model, metric_params)
else streaming final message
BedrockInstrumentor->>PromptCacheHandler: prompt_caching_converse_handling(metadata, vendor, model, metric_params)
end
rect #EEF8FF
Note over PromptCacheHandler: extract cache_read/cache_creation\nset span attrs and gen_ai.usage.* token attrs
PromptCacheHandler->>Span: set CACHED ("read"/"write") and token attributes
end
alt cache_read > 0
PromptCacheHandler->>Metrics: emit cache read metric
end
alt cache_creation > 0
PromptCacheHandler->>Metrics: emit cache write metric
end
BedrockInstrumentor-->>Client: return response / stream end
Estimated code review effort
🎯 3 (Moderate) | ⏱️ ~20 minutes
- Verify response/metadata key paths (e.g.,
usage_metadata,input_token_details) match Bedrock Converse shapes. - Confirm span existence and is_recording checks are correct for streaming final-message handling.
- Ensure handler is only invoked for final events (not intermediate stream events).
- Review the duplicated test definition in
test_anthropic.pyand cassette expectations.
Suggested reviewers
- galkleinman
Poem
🐰 I hopped through responses, counted tokens by light,
read and wrote some markers, and set spans just right.
Metrics took note as the cache did its dance,
traces wore small badges from a fortuitous chance.
Hooray — Converse caching tracked in a joyful prance!
Pre-merge checks and finishing touches
❌ Failed checks (2 warnings, 1 inconclusive)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Out of Scope Changes Check | ⚠️ Warning | The changes to test_anthropic.py appear to be out of scope for a PR focused on "Add prompt caching support for Converse API" (Bedrock); while they introduce caching test utilities and a new caching test for Anthropic models, they do not directly align with the Bedrock Converse objectives stated in issue #3337. More critically, the summary indicates that test_anthropic_converse_with_caching is defined twice (duplicate function definition), which is a clear code error that should not be present in a merged changeset regardless of scope considerations. |
Remove the duplicate test_anthropic_converse_with_caching function definition from test_anthropic.py. Additionally, clarify whether the Anthropic test changes are intentional and necessary for this PR; if they represent a separate feature effort, consider moving them to a separate pull request focused on Anthropic caching to keep this PR focused on its stated Bedrock Converse objectives. |
| Docstring Coverage | ⚠️ Warning | Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. | You can run @coderabbitai generate docstrings to improve docstring coverage. |
| Linked Issues Check | ❓ Inconclusive | The code implementation addresses the core requirements from issue #3337 by creating the prompt_caching_converse_handling function that extracts cache metrics from the Converse response body, integrating it into both _handle_converse and _handle_converse_stream, recording cache-related metrics and span attributes, and emitting telemetry to the gen_ai usage attributes. However, the linked requirement to add and validate tests covering both cache-write and cache-read flows is not fully satisfied; the test cassette for test_titan_converse_with_caching.yaml contains a 403 authentication error response rather than a successful caching scenario, and reviewers have reported botocore ValidationExceptions indicating the test input contains an extraneous cacheControl key that is not permitted by the API, preventing successful validation of the implementation. |
The implementation should be validated by running the tests with valid AWS credentials to properly record the VCR cassettes and ensure they capture successful caching scenarios. Additionally, the test input should be corrected to remove or properly format the cacheControl parameter to resolve the botocore ValidationException before the PR can be fully verified against the issue requirements. |
✅ Passed checks (2 passed)
| Check name | Status | Explanation |
|---|---|---|
| Description Check | ✅ Passed | Check skipped - CodeRabbit’s high-level summary is enabled. |
| Title Check | ✅ Passed | The pull request title "fix(bedrock): Add prompt caching support for Converse API" accurately and concisely describes the main objective of the changeset. The primary changes across the codebase—adding the prompt_caching_converse_handling function, integrating it into _handle_converse and _handle_converse_stream, and introducing associated tests—all directly align with this stated goal. The title is specific and clear without unnecessary noise, making it immediately obvious to reviewers what aspect of the Bedrock instrumentation is being enhanced. |
✨ Finishing touches
- [ ] 📝 Generate docstrings
🧪 Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
📜 Recent review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📥 Commits
Reviewing files that changed from the base of the PR and between 771b5d5971eaa1e84b528e927bc1695320f9416a and ea2b0dac8f13b8724e3cfcd2c5caedde3b2c3660.
📒 Files selected for processing (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code Use Flake8 for code linting and adhere to its rules
Files:
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py
🧬 Code graph analysis (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py (5)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py (1)
PromptCaching(470-472)packages/opentelemetry-instrumentation-milvus/tests/conftest.py (1)
reader(37-41)packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (1)
CacheSpanAttrs(9-11)packages/opentelemetry-instrumentation-bedrock/tests/conftest.py (2)
instrument_legacy(90-102)brt(36-42)packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (1)
get_finished_spans(40-43)
🪛 Ruff (0.14.2)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py
22-22: Create your own exception
(TRY002)
22-22: Avoid specifying long messages outside the exception class
(TRY003)
1113-1113: Unused function argument: instrument_legacy
(ARG001)
🔇 Additional comments (2)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py (2)
16-44: LGTM! Clear helper functions for metric validation.The
get_metricandassert_metrichelper functions are well-structured for validating prompt caching metrics in tests. The logic correctly distinguishes between cache read and write operations and verifies the token counts.
1112-1171: Test implementation looks correct; VCR cassette recording pending.The test logic is sound and all critical issues from previous reviews have been addressed:
- ✓ Uses
cachePointblocks in message content (correct API usage)- ✓ Uses
GenAIAttributes.GEN_AI_REQUEST_MODEL(correct attribute)- ✓ Proper indentation and formatting
- ✓ Correct cache point type
"default"The test validates both cache-write and cache-read flows with appropriate assertions on usage tokens, span attributes, and metrics. According to the PR objectives, the VCR cassette (
test_anthropic_converse_with_caching.yaml) still needs to be recorded by a reviewer with AWS credentials.Note: The static analysis warning about unused
instrument_legacyis a false positive—it's a pytest fixture that sets up instrumentation.
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
Hi @nirga , I've updated the PR with the changes we discussed. I moved the logic into a new test in test_titan.py and deleted the old file.
It should be ready for the cassette to be recorded now. Thanks again for your help!
Thanks @AlanPonnachan! Can you record the test? poetry run pytest --record-mode=once
Hi @nirga, thank you again for all your guidance.
I tried to run the recording command locally as you requested. As expected, since I don't have AWS credentials, the test run fails with an UnrecognizedClientException (invalid security token). This confirms that the test is now correctly set up and is just waiting for a real recording to be generated.
I believe the PR is now ready from a code perspective. Would you be able to run the recording on your end when you have a moment?
Thank you so much for your help
Thanks @AlanPonnachan, I tried running it locally and the test failed:
FAILED tests/traces/test_titan.py::test_titan_converse_with_caching - botocore.errorfactory.ValidationException: An error occurred (ValidationException) when calling the Converse operation: The model returned the following errors: Malformed input request: extraneous key [cacheControl] is not permitted, please reformat your input and try again.
Hi @nirga,
Thank you so much for running the test and providing that error!
I did some research and found that, in general, newer Amazon Titan models do support prompt caching. However, you are right that the specific model in the original test, amazon.titan-text-express-v1, does not support it, which perfectly explains the error.
To ensure the test is stable and reliable, I have implemented the fix we discussed:
- I reverted all changes to
test_titan.pyto ensure it remains focused on Guardrails. - I moved the caching test logic into a new function,
test_anthropic_converse_with_caching, within the correcttest_anthropic.pyfile. - The new test uses a supported Anthropic Claude 3 model and the correct API structure.
Would you be able to run the recording on your end when you have a moment?