Roy Belio
Roy Belio
on it
https://github.com/Blaizzy/mlx-vlm/pull/268/
A few notes here: at least originally this was intentional: the test_inference_client_caching.py:128-139 explicitly verifies that clients are NOT cached because API keys can come from different users per-request via the...
@franciscojavierarceo is this and #4021 been addressed in any way? I'd like to take the effort
/assign This is a legitimate bug that creates confusion for users. The bug exists in two locations: vector_store.py:144 - The `content_from_doc()` function uses a regex pattern that matches file://: `pattern...
The bug is at [streaming.py:573](https://github.com/llamastack/llama-stack/blob/aac494c5baca31fca434c197e65567f1ee8672b2/src/llama_stack/providers/inline/agents/meta_reference/responses/streaming.py#L573) where chunk_finish_reason = "" is initialized. When streaming chunks don't provide a finish_reason (might be related to with Llama providers), this empty string fails OpenAI...
@ashwinb this error is still reproduced, can I take it, it looks like an easy fix in server.py replacing `await event_gen.aclose()` (which doesn't exist for `AsyncStream` in the openai with...
Additionally to the fix itself as part of the PR. the user can provide the following as argument `{"tags": "tag0,tag1"}`, later to be split again by the user `output.split(',')`
When running in test mode with Gunicorn: Multiple worker processes are spawned Each worker has separate telemetry instrumentation The mock OTLP collector can't capture spans from all workers Tests expect...
> This looks good, my only comment would to fail the server start if any of the metadata store is SQLITE AND gunicorn is used. If we have multiple workers...