llama-stack icon indicating copy to clipboard operation
llama-stack copied to clipboard

Fix AttributeError in streaming response cleanup

Open r-bit-rry opened this issue 1 month ago • 7 comments

This PR fixes issue #3185 The code calls await event_gen.aclose() but OpenAI's AsyncStream doesn't have an aclose() method - it has close() (which is async). when clients cancel streaming requests, the server tries to clean up with:

await event_gen.aclose()  # ❌ AsyncStream doesn't have aclose()!

But AsyncStream has never had a public aclose() method. The error message literally tells us:

AttributeError: 'AsyncStream' object has no attribute 'aclose'. Did you mean: 'close'?
                                                                            ^^^^^^^^

Verification

  • Reproduction script reproduce_issue_3185.sh can be used to verify the fix.
  • Manual checks, validation against original OpenAI library code

r-bit-rry avatar Nov 26 '25 08:11 r-bit-rry

@r-bit-rry a chain of hasattr like this suggests we've done something wrong in the design. have we or can we just call close?

It really comes down to what we want to support, since this was never strictly typed, I'm assuming there are other objects that can be generated by the sse_generator. I'd go even a step further to ask why do we even assume close method exists?

and on a more serious note @mattf PEP 525 defines aclose() for async generators OpenAI and Anthropic SDKs deviate from the standard and use close() instead (while underneath the hood they both call httpx.Response.aclose() ) We cannot control third-party SDK design choices

so its our decision if we want to enforce certain typings, and act upon, or let this pattern "catch all"

r-bit-rry avatar Nov 27 '25 06:11 r-bit-rry

@mattf sure thing, I'll start working on those

r-bit-rry avatar Nov 27 '25 12:11 r-bit-rry

@mattf We're facing two options in order to avoid hasattr chain as I see it when treating the AsyncStream object: option 1: wrapping

async def wrap_async_stream(stream): 
    async for item in stream: yield item

this is explicit and simple but carries a small overhead per chunk

option 2: adapter pattern

class AsyncStreamAdapter(AsyncIterator[T]):
    def __init__(self, stream): self._stream = stream
    async def aclose(self): await self._stream.close()  # Delegate close→aclose

direct delegation with no re-yielding and a more explicit intent

regarding locations of violations, where we will need patching, these are the places I was able to spot:

  1. _maybe_overwrite_id(): Location: src/llama_stack/providers/utils/inference/openai_mixin.py:251 when overwrite_completion_id=False and stream=True

returned AsyncStream (has close()) instead of AsyncIterator (has aclose())

  1. PassthroughInferenceAdapter - openai_chat_completion() Location: src/llama_stack/providers/remote/inference/passthrough/passthrough.py:123-136

Returned raw client.chat.completions.create() response

  1. PassthroughInferenceAdapter - openai_completion() Location: src/llama_stack/providers/remote/inference/passthrough/passthrough.py:108-121

Returned raw client.completions.create() response

  1. LiteLLMOpenAIMixin - openai_chat_completion() Location: src/llama_stack/providers/utils/inference/litellm_openai_mixin.py:222-278

Returned raw litellm.acompletion() result

  1. LiteLLMOpenAIMixin - openai_completion() Location: src/llama_stack/providers/utils/inference/litellm_openai_mixin.py:179-220 Returned raw litellm.atext_completion() result

r-bit-rry avatar Nov 27 '25 15:11 r-bit-rry

@r-bit-rry great finds! it looks like we're violating the api contract and using # ignore to cover it up...resulting in the bug. what fix do you suggest?

mattf avatar Nov 27 '25 16:11 mattf

@mattf I suggest using the first option, with direct declaration of the types, and changing the API contract for openai_completion, removing the #type: ignore comments. since I'm already touching all the places where the return value of the method is handled, I think fixing the contract there is the most responsible way of handling this.

r-bit-rry avatar Nov 30 '25 08:11 r-bit-rry

@r-bit-rry sounds good!

mattf avatar Nov 30 '25 10:11 mattf

@mattf as I'm digging, I'm finding new stuff, apparently we have a discrepancy between official and internal documentation of openai_completion and the implementation in place: Request schema from inference.py:871: class OpenAICompletionRequestWithExtraBody(BaseModel, extra="allow"): # ... stream: bool | None = None # ← Streaming IS a parameter!

however the documentation both on our end and on openai official end does not support streaming to begin with (docs/docs/contributing/new_api_provider.mdx:40-43:

  • openai_completion(): Legacy text completion API with full parameter support
  • openai_chat_completion(): Chat completion API supporting streaming, tools, and function calling

OpenAI documentation: Schema: CreateCompletionResponse "Represents a completion response from the API. Note: both the streamed and non-streamed response objects share the same shape (unlike the chat endpoint)." so only our documentation is lacking. I've taken this into the changes I've comitted

r-bit-rry avatar Nov 30 '25 15:11 r-bit-rry

can we get rid of the type ignores now?

Remaining type: ignore comments are only for external library issues: type: ignore[arg-type] - LiteLLM streaming types don't match OpenAI type: ignore[return-value] - External libs lack type stubs

what about passthrough?

mattf avatar Dec 01 '25 15:12 mattf

can we get rid of the type ignores now?

Remaining type: ignore comments are only for external library issues: type: ignore[arg-type] - LiteLLM streaming types don't match OpenAI type: ignore[return-value] - External libs lack type stubs

what about passthrough?

regarding the passthrough type: ignore comments. those are actually for the non-streaming path, not the streaming fix.

The issue is just a type mismatch between what the OpenAI SDK returns (Completion/ChatCompletion) and what Llama Stack types expect (OpenAICompletion/OpenAIChatCompletion). Structurally they're the same, but as far as I understand mypy doesn't see them as identical.

The streaming path (wrap_async_stream(response)) doesn't have any type ignores and doesn't need one since wrap_async_stream properly returns AsyncIterator[T].

So removing these would need either explicit type conversions or aligning llamastack type definitions with the SDK types directly, I think this is out of scope for this fix. Happy to tackle that separately if you think it's worth it though

r-bit-rry avatar Dec 07 '25 14:12 r-bit-rry

can we get rid of the type ignores now?

Remaining type: ignore comments are only for external library issues: type: ignore[arg-type] - LiteLLM streaming types don't match OpenAI type: ignore[return-value] - External libs lack type stubs

what about passthrough?

regarding the passthrough type: ignore comments. those are actually for the non-streaming path, not the streaming fix.

The issue is just a type mismatch between what the OpenAI SDK returns (Completion/ChatCompletion) and what Llama Stack types expect (OpenAICompletion/OpenAIChatCompletion). Structurally they're the same, but as far as I understand mypy doesn't see them as identical.

The streaming path (wrap_async_stream(response)) doesn't have any type ignores and doesn't need one since wrap_async_stream properly returns AsyncIterator[T].

So removing these would need either explicit type conversions or aligning llamastack type definitions with the SDK types directly, I think this is out of scope for this fix. Happy to tackle that separately if you think it's worth it though

i see. pls file an issue about that so we can resolve it later.

mattf avatar Dec 09 '25 15:12 mattf

@mattf I created #4369 per you request

r-bit-rry avatar Dec 11 '25 09:12 r-bit-rry

✱ Stainless preview builds

This PR will update the llama-stack-client SDKs with the following commit message.

fix(inference): AttributeError in streaming response cleanup
⚠️ llama-stack-client-node studio · code

There was a regression in your SDK. generate ⚠️build ✅lint ✅test ✅

npm install https://pkg.stainless.com/s/llama-stack-client-node/b634edc8ffeed4f76fd9bd0eeedafd0f441a9dad/dist.tar.gz
⚠️ llama-stack-client-kotlin studio · code

There was a regression in your SDK. generate ⚠️lint ✅test ❗

⚠️ llama-stack-client-python studio · conflict

There was a regression in your SDK.

⚠️ llama-stack-client-go studio · code

There was a regression in your SDK. generate ❗lint ❗test ❗

go get github.com/stainless-sdks/llama-stack-client-go@703cd5b2f741deeb209cc8d1a039265ea58676e0

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
Last updated: 2025-12-14 12:58:02 UTC

github-actions[bot] avatar Dec 11 '25 09:12 github-actions[bot]