openllmetry icon indicating copy to clipboard operation
openllmetry copied to clipboard

fix(bedrock): Add prompt caching support for Converse API

Open AlanPonnachan opened this issue 2 months ago • 10 comments

  • [x] I have added tests that cover my changes.
  • [ ] If adding a new instrumentation or changing an existing one, I've added screenshots from some observability platform showing the change.
  • [x] PR name follows conventional commits format: feat(instrumentation): ... or fix(instrumentation): ....
  • [ ] (If applicable) I have updated the documentation accordingly.

Description

This PR introduces prompt caching telemetry for the AWS Bedrock Converse and Converse Stream APIs, bringing feature parity with the existing invoke_model instrumentation.

The Converse API reports caching information in the usage field of the response body, rather than through HTTP headers. This implementation adds the necessary logic to parse this information and record it as metrics and span attributes.

Changes include:

  1. New function prompt_caching_converse_handling in prompt_caching.py to extract cache_read_input_tokens and cache_creation_input_tokens from the response body.
  2. Integration into __init__.py: The new function is now called from _handle_converse and _handle_converse_stream to process caching data for both standard and streaming calls.
  3. New Test File: Added test_bedrock_converse_prompt_caching_metrics.py to validate that the gen_ai.prompt.caching metric is correctly emitted for the Converse API.

Fixes #3337


[!IMPORTANT] Adds prompt caching telemetry for AWS Bedrock Converse APIs, including new function for caching data extraction and corresponding tests.

  • Behavior:
    • Adds prompt_caching_converse_handling in prompt_caching.py to extract caching data from Converse API response body.
    • Integrates prompt_caching_converse_handling into _handle_converse and _handle_converse_stream in __init__.py.
  • Testing:
    • Adds test_bedrock_converse_prompt_caching_metrics.py to validate gen_ai.prompt.caching metric emission for Converse API.
  • Misc:
    • Fixes #3337.

This description was created by Ellipsis for 4fa37924888b5501ff40e9becdb27551deb01de6. You can customize this summary. It will automatically update as commits are pushed.

Summary by CodeRabbit

  • New Features

    • Prompt-caching detection for Converse (streaming and non‑streaming), recording cache read/write state and token-level usage into telemetry and metrics.
  • Tests

    • Added tests validating cache write and cache read scenarios, span attributes, and aggregated token metrics.
    • Added a test cassette exercising a Converse error response path.

AlanPonnachan avatar Sep 19 '25 14:09 AlanPonnachan

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Sep 19 '25 14:09 CLAassistant

Hi @nirga

I’ve resolved the lint test failures. The remaining failing test, test_prompt_cache_converse, is expected since it requires a VCR cassette to be recorded.

As I don’t have access to an active AWS account, I’m unable to generate the test_prompt_cache_converse.yaml cassette file myself. Would you be able to check out this branch, run the test and push the generated cassette file to this PR?

Thanks for your help!

AlanPonnachan avatar Sep 19 '25 21:09 AlanPonnachan

Thanks for the great suggestion and for your willingness to help record the test!

I agree that relying on an existing test is a cleaner approach. Before I push the changes, I just want to confirm my plan sounds good to you.

Here is what I am planning to do:

  1. Modify the Existing Test: I will update the test_titan_converse function in tests/traces/test_titan.py.
  2. Enable Caching: I'll add the additionalModelRequestFields with cacheControl to the existing brt.converse API call.
  3. Test Both Scenarios: I will add a second brt.converse call within that same test to ensure we cover both the initial "cache-write" and the subsequent "cache-read".
  4. Add Assertions: I will add the metric assertions I wrote to validate that the prompt caching counters are working correctly.
  5. Clean Up: Finally, I will delete the new test file I originally created (test_bedrock_converse_prompt_caching_metrics.py).

This will result in the cassette for test_titan_converse.yaml needing to be re-recorded, as you mentioned.

Does this plan look good? If so, I'll go ahead and make the changes.

AlanPonnachan avatar Sep 21 '25 06:09 AlanPonnachan

@nirga any update on this?

AlanPonnachan avatar Oct 28 '25 14:10 AlanPonnachan

[!NOTE]

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds response/metadata-based prompt caching handling for Bedrock Converse and invokes it from both synchronous and streaming converse flows; introduces a new handler that reads cache read/write token counts from responses/metadata, sets span attributes, and emits cache metrics when enabled.

Changes

Cohort / File(s) Summary
Instrumentation entrypoints
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py
Imported prompt_caching_converse_handling (and prompt_caching_handling) and added calls to prompt_caching_converse_handling(...) in _handle_converse() and in the final-message path of _handle_converse_stream() after guardrail handling.
Prompt caching logic
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py
Added prompt_caching_converse_handling(response, vendor, model, metric_params) which extracts cache read/write token counts from a response or metadata object, verifies current span presence and recording, sets CACHED span attribute and gen_ai.usage.* token attributes, and emits cache read/write metrics when configured.
Tests & cassettes
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py, packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py, packages/opentelemetry-instrumentation-bedrock/tests/traces/cassettes/test_titan/test_titan_converse_with_caching.yaml
Added metric helper functions and test_anthropic_converse_with_caching to validate caching read/write behaviour (note: duplicate test definition present); minor formatting tweak in test_titan.py; added a Titan Converse cassette capturing a 403 auth-failure response for converse-with-caching scenario.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant BedrockInstrumentor
    participant BedrockAPI
    participant PromptCacheHandler
    participant Span
    participant Metrics

    Client->>BedrockInstrumentor: invoke_converse(...)
    BedrockInstrumentor->>BedrockAPI: call Converse (sync or stream)
    BedrockAPI-->>BedrockInstrumentor: response / streamed events (include usage_metadata or metadata)

    alt sync final response
      BedrockInstrumentor->>PromptCacheHandler: prompt_caching_converse_handling(response, vendor, model, metric_params)
    else streaming final message
      BedrockInstrumentor->>PromptCacheHandler: prompt_caching_converse_handling(metadata, vendor, model, metric_params)
    end

    rect #EEF8FF
      Note over PromptCacheHandler: extract cache_read/cache_creation\nset span attrs and gen_ai.usage.* token attrs
      PromptCacheHandler->>Span: set CACHED ("read"/"write") and token attributes
    end

    alt cache_read > 0
      PromptCacheHandler->>Metrics: emit cache read metric
    end
    alt cache_creation > 0
      PromptCacheHandler->>Metrics: emit cache write metric
    end

    BedrockInstrumentor-->>Client: return response / stream end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Verify response/metadata key paths (e.g., usage_metadata, input_token_details) match Bedrock Converse shapes.
  • Confirm span existence and is_recording checks are correct for streaming final-message handling.
  • Ensure handler is only invoked for final events (not intermediate stream events).
  • Review the duplicated test definition in test_anthropic.py and cassette expectations.

Suggested reviewers

  • galkleinman

Poem

🐰 I hopped through responses, counted tokens by light,

read and wrote some markers, and set spans just right.
Metrics took note as the cache did its dance,
traces wore small badges from a fortuitous chance.
Hooray — Converse caching tracked in a joyful prance!

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Out of Scope Changes Check ⚠️ Warning The changes to test_anthropic.py appear to be out of scope for a PR focused on "Add prompt caching support for Converse API" (Bedrock); while they introduce caching test utilities and a new caching test for Anthropic models, they do not directly align with the Bedrock Converse objectives stated in issue #3337. More critically, the summary indicates that test_anthropic_converse_with_caching is defined twice (duplicate function definition), which is a clear code error that should not be present in a merged changeset regardless of scope considerations. Remove the duplicate test_anthropic_converse_with_caching function definition from test_anthropic.py. Additionally, clarify whether the Anthropic test changes are intentional and necessary for this PR; if they represent a separate feature effort, consider moving them to a separate pull request focused on Anthropic caching to keep this PR focused on its stated Bedrock Converse objectives.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Linked Issues Check ❓ Inconclusive The code implementation addresses the core requirements from issue #3337 by creating the prompt_caching_converse_handling function that extracts cache metrics from the Converse response body, integrating it into both _handle_converse and _handle_converse_stream, recording cache-related metrics and span attributes, and emitting telemetry to the gen_ai usage attributes. However, the linked requirement to add and validate tests covering both cache-write and cache-read flows is not fully satisfied; the test cassette for test_titan_converse_with_caching.yaml contains a 403 authentication error response rather than a successful caching scenario, and reviewers have reported botocore ValidationExceptions indicating the test input contains an extraneous cacheControl key that is not permitted by the API, preventing successful validation of the implementation. The implementation should be validated by running the tests with valid AWS credentials to properly record the VCR cassettes and ensure they capture successful caching scenarios. Additionally, the test input should be corrected to remove or properly format the cacheControl parameter to resolve the botocore ValidationException before the PR can be fully verified against the issue requirements.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "fix(bedrock): Add prompt caching support for Converse API" accurately and concisely describes the main objective of the changeset. The primary changes across the codebase—adding the prompt_caching_converse_handling function, integrating it into _handle_converse and _handle_converse_stream, and introducing associated tests—all directly align with this stated goal. The title is specific and clear without unnecessary noise, making it immediately obvious to reviewers what aspect of the Bedrock instrumentation is being enhanced.
✨ Finishing touches
  • [ ] 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 771b5d5971eaa1e84b528e927bc1695320f9416a and ea2b0dac8f13b8724e3cfcd2c5caedde3b2c3660.

📒 Files selected for processing (1)
  • packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code Use Flake8 for code linting and adhere to its rules

Files:

  • packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py
🧬 Code graph analysis (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py (5)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py (1)
  • PromptCaching (470-472)
packages/opentelemetry-instrumentation-milvus/tests/conftest.py (1)
  • reader (37-41)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (1)
  • CacheSpanAttrs (9-11)
packages/opentelemetry-instrumentation-bedrock/tests/conftest.py (2)
  • instrument_legacy (90-102)
  • brt (36-42)
packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (1)
  • get_finished_spans (40-43)
🪛 Ruff (0.14.2)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py

22-22: Create your own exception

(TRY002)


22-22: Avoid specifying long messages outside the exception class

(TRY003)


1113-1113: Unused function argument: instrument_legacy

(ARG001)

🔇 Additional comments (2)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py (2)

16-44: LGTM! Clear helper functions for metric validation.

The get_metric and assert_metric helper functions are well-structured for validating prompt caching metrics in tests. The logic correctly distinguishes between cache read and write operations and verifies the token counts.


1112-1171: Test implementation looks correct; VCR cassette recording pending.

The test logic is sound and all critical issues from previous reviews have been addressed:

  • ✓ Uses cachePoint blocks in message content (correct API usage)
  • ✓ Uses GenAIAttributes.GEN_AI_REQUEST_MODEL (correct attribute)
  • ✓ Proper indentation and formatting
  • ✓ Correct cache point type "default"

The test validates both cache-write and cache-read flows with appropriate assertions on usage tokens, span attributes, and metrics. According to the PR objectives, the VCR cassette (test_anthropic_converse_with_caching.yaml) still needs to be recorded by a reviewer with AWS credentials.

Note: The static analysis warning about unused instrument_legacy is a false positive—it's a pytest fixture that sets up instrumentation.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Oct 29 '25 11:10 coderabbitai[bot]

Hi @nirga , I've updated the PR with the changes we discussed. I moved the logic into a new test in test_titan.py and deleted the old file.

It should be ready for the cassette to be recorded now. Thanks again for your help!

AlanPonnachan avatar Oct 29 '25 16:10 AlanPonnachan

Thanks @AlanPonnachan! Can you record the test? poetry run pytest --record-mode=once

nirga avatar Oct 30 '25 14:10 nirga

Hi @nirga, thank you again for all your guidance.

I tried to run the recording command locally as you requested. As expected, since I don't have AWS credentials, the test run fails with an UnrecognizedClientException (invalid security token). This confirms that the test is now correctly set up and is just waiting for a real recording to be generated.

I believe the PR is now ready from a code perspective. Would you be able to run the recording on your end when you have a moment?

Thank you so much for your help

AlanPonnachan avatar Oct 30 '25 17:10 AlanPonnachan

Thanks @AlanPonnachan, I tried running it locally and the test failed:

FAILED tests/traces/test_titan.py::test_titan_converse_with_caching - botocore.errorfactory.ValidationException: An error occurred (ValidationException) when calling the Converse operation: The model returned the following errors: Malformed input request: extraneous key [cacheControl] is not permitted, please reformat your input and try again.

nirga avatar Oct 30 '25 19:10 nirga

Hi @nirga,

Thank you so much for running the test and providing that error!

I did some research and found that, in general, newer Amazon Titan models do support prompt caching. However, you are right that the specific model in the original test, amazon.titan-text-express-v1, does not support it, which perfectly explains the error.

To ensure the test is stable and reliable, I have implemented the fix we discussed:

  1. I reverted all changes to test_titan.py to ensure it remains focused on Guardrails.
  2. I moved the caching test logic into a new function, test_anthropic_converse_with_caching, within the correct test_anthropic.py file.
  3. The new test uses a supported Anthropic Claude 3 model and the correct API structure.

Would you be able to run the recording on your end when you have a moment?

AlanPonnachan avatar Nov 02 '25 17:11 AlanPonnachan