llama-stack icon indicating copy to clipboard operation
llama-stack copied to clipboard

reduce overhead of creating an OpenAI Client for each inference request

Open mattf opened this issue 1 month ago • 1 comments

🚀 Describe the new functionality needed

every inference request creates a new OpenAIClient -

$ python -m timeit -s "from openai import Client" \                    
  "Client(api_key='none', base_url='http://localhost/')"                   
20 loops, best of 5: 11.2 msec per loop

this limits number of requests that can be handled per second. roughly 90 rps on the test system.

💡 Why is this needed? What if we don't build it?

inference request performance will be limited

Other thoughts

the majority of the time is in ssl context construction, which touches disk -

$ python -m timeit -s "import ssl" \
  "ssl.create_default_context()"
20 loops, best of 5: 11.3 msec per loop

suggested solution: reuse the ssl context

$ python -m timeit -s "from openai import Client; import ssl, httpx; \
      shared_ctx = ssl.create_default_context()" \
  "Client(api_key='none', base_url='http://localhost/', http_client=httpx.Client(verify=shared_ctx))"
2000 loops, best of 5: 113 usec per loop

this allows for roughly 9,000 rps on the test system.

risk: ssl context changes will not be detected until server restart

httpx.Client has additional state that should not be shared between users, please do not do -

$ python -m timeit -s "from openai import Client; import ssl, httpx; \
      shared_http = httpx.Client(verify=ssl.create_default_context())" \
  "Client(api_key='none', base_url='http://localhost/', http_client=shared_http)"
10000 loops, best of 5: 27.7 usec per loop

mattf avatar Nov 24 '25 15:11 mattf

stolen there's no permission for the loop

g2bkqky4rh-a11y avatar Nov 27 '25 13:11 g2bkqky4rh-a11y

risk: ssl context changes will not be detected until server restart

can you explain what causes SSL context to change?

ashwinb avatar Dec 01 '25 23:12 ashwinb

A few notes here: at least originally this was intentional: the test_inference_client_caching.py:128-139 explicitly verifies that clients are NOT cached because API keys can come from different users per-request via the x-llamastack-provider-data header.

def test_openai_provider_data_used(config_cls, adapter_cls, provider_data_validator: str, config_params: dict):
    """Ensure the OpenAI provider does not cache api keys across client requests"""

Sharing SSL context is acceptable, the risk: SSL context staleness where certificate revocations and CA updates won't be detected until server restart -> this is acceptable for most deployments as certificate changes are typically planned events requiring deployment anyway.

I'm +1 on this, and think its an easy win for the project as long as we avoid sharing the entire httpx.Client object

r-bit-rry avatar Dec 03 '25 08:12 r-bit-rry

@r-bit-rry 💯

mattf avatar Dec 03 '25 12:12 mattf