Connection Pool Limits Cause Sequential Processing Instead of Concurrent Execution

Summary

BAML appears to have connection pool limits that cause high-concurrency requests to be processed sequentially rather than concurrently, despite correct usage of asyncio.gather(). This manifests as a distinctive timing pattern where requests complete in sequential batches rather than truly in parallel.

Environment

BAML Version: 0.208.5 (latest as of issue creation: 0.211.0)
Python Version: 3.12.5
OS: macOS
Usage Pattern: 20+ concurrent requests via asyncio.gather()

Issue Details

Expected Behavior

When making multiple concurrent BAML calls with asyncio.gather(), requests should execute in parallel with completion times distributed based on actual API response times.

Actual Behavior

Requests are processed in sequential batches (~6 at a time), creating this pattern:

First ~6 requests: Complete sequentially with 1.5-2s gaps between each
Sudden burst: 6+ requests complete within milliseconds of each other
Pattern repeats: Indicating connection pool cycling rather than true concurrency

Evidence from Production Logs

Sequential Processing Phase:

2025-10-08 17:22:26,888 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-08 17:22:28,519 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"  [Gap: 1.631s]
2025-10-08 17:22:30,228 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"  [Gap: 1.709s] 
2025-10-08 17:22:31,689 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"  [Gap: 1.461s]
2025-10-08 17:22:33,466 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"  [Gap: 1.777s]
2025-10-08 17:22:35,298 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"  [Gap: 1.832s]

Then Sudden Concurrent Burst:

2025-10-08 17:22:47,930 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-08 17:22:47,930 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"  [Gap: 0ms]
2025-10-08 17:22:47,931 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"  [Gap: 1ms]
2025-10-08 17:22:47,931 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"  [Gap: 0ms]
2025-10-08 17:22:47,932 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"  [Gap: 1ms]
2025-10-08 17:22:47,933 INFO httpx HTTP Request: POST http://localhost:8080/v1/chat/completions "HTTP/1.1 200 OK"  [Gap: 1ms]

User Code (Correctly Implemented)

async def concurrent_simplified_generation(queries, context_chunks_list, baml_options):
    """From backend/backend/core/agents/helpers.py - correctly uses asyncio.gather"""
    tasks = []
    for query, context_chunks in zip(queries, context_chunks_list, strict=True):
        task = simplified_baml_qa_response(query, ..., baml_options=baml_options)
        tasks.append(task)
    
    # This should enable true concurrency, but BAML appears to serialize internally
    return await asyncio.gather(*tasks)

Relationship to Previous Work

Acknowledgment: The BAML team has already addressed several connection pool issues:

PR #1027/#1028: Fixed idle connection stalling in FFI boundaries
PR #2205: Fixed file descriptor leaks with pool timeouts

This issue is different:

Previous fixes addressed idle connections and resource leaks
This issue is about active connection limits preventing true concurrency
The distinctive timing pattern suggests connection pool exhaustion rather than idle timeouts

Root Cause Analysis

BAML uses requests/httpx internally but appears to have connection pool limits that aren't suitable for high-concurrency scenarios. The current configuration likely allows ~6 concurrent connections, causing additional requests to queue rather than execute in parallel.

Impact

Performance degradation: 20 concurrent requests that should complete in ~3-5s take 30-50s
Poor resource utilization: CPU and network remain idle while requests queue
Unpredictable latency: Request completion depends on queue position, not actual processing

Proposed Solutions

Expose connection pool configuration in BAML client options
Increase default connection limits for modern high-concurrency use cases
Add configuration similar to the existing timeout proposal in #1630

Additional Context

Issue becomes pronounced with 10+ concurrent requests
BAML version 0.208.5, but reviewing through 0.211.0 shows no related fixes
This significantly impacts batch processing and parallel generation workflows
Related to #1630 (configurable timeouts) but specifically about connection limits

Reproducible: Yes, consistently observed across multiple test runs and production usage

Oct 08 '25 21:10 justinthelaw

BAML-515

Oct 08 '25 21:10 linear[bot]

Thanks for sharing this bug @justinthelaw with a very detailed repro! we should be able to patch this.

Oct 08 '25 23:10 hellovai

Do you use the baml async client or sync client?

Oct 10 '25 20:10 aaronvg

@aaronvg We use the BAML async client (BamlAsyncClient).

Oct 10 '25 21:10 justinthelaw

ok we are taking a look now

Oct 10 '25 21:10 aaronvg

Just in case I am actually just making a simple async Python code mistake, here is another representative example of our call to the BAML client:

async def concurrent_generation(
    queries,
    context_chunks_list,
    baml_options,
    system_prompt,
    user_prompt_template,
):
    tasks = []
    for query, context_chunks in zip(queries, context_chunks_list, strict=True):
        task = simplified_baml_qa_response(
            query, system_prompt, user_prompt_template, context_chunks, max_tokens=2500, baml_options=baml_options
        )
        tasks.append(task)
    return await asyncio.gather(*tasks)

where simplified_baml_qa_response is,

async def simplified_baml_qa_response(
    query,
    system_prompt,
    user_prompt_template,
    context_chunks,
    max_tokens,
    baml_options,
):
    user_prompt = _build_user_prompt(query, context_chunks, user_prompt_template)
    baml_messages = [{'role': 'user', 'content': user_prompt}]
    return await b.ChainOfThoughtCall(system_prompt.value, baml_messages, baml_options=baml_options)

where b is,

from baml_client.async_client import b

Oct 10 '25 21:10 justinthelaw

perfect, thanks for all the info

Oct 10 '25 21:10 aaronvg

I wrote a test to check this (#2605) which spawns a local server with a latency parameter and responds to each request in that time. In theory, since requests are processed concurrently, if you send N requests at once it should respond roughly within the time it takes to process a single request plus some additional overhead for scheduling promises.

So far I couldn't reproduce this bug of sequential batches of 6 requests, they all run concurrently when sending them to the local server. However when sending them to OpenAI I do see some sequential behavior, so I'm wondering if the problem is somewhere else or it's on their end, this openai dev community post describes a similar issue.

I'll test a couple more things but I'm not sure this is a bug in the Baml runtime (which btw uses reqwest crate in Rust, not Python's requests / httpx).

Oct 12 '25 18:10 antoniosarosi

If you could run that particular test against a local server that returns some mock response instead of in production it would be great to see if issue is Baml or somewhere else. You can configure the client using openai-generic provider and a base URL pointing to the local test server.

Oct 12 '25 18:10 antoniosarosi

If you could run that particular test against a local server that returns some mock response instead of in production it would be great to see if issue is Baml or somewhere else. You can configure the client using openai-generic provider and a base URL pointing to the local test server.

@antoniosarosi oh interesting finds! We actually use AWS Anthropic Bedrock locally (IAM Roles Anywhere) for development, and we also use the standard OpenAI provider when using a custom Mock LLM or local llama.cpp server. I encountered the issue when using the OpenAI provider, so your tests definitely help narrow down where the problem might be.

I haven't run these against the AWS Anthropic Bedrock provider yet, since we usually only use it for performance (e.g., evals against golden data sets, for testing specific agentic behavior), but I can later next week.

Oct 12 '25 22:10 justinthelaw

AWS Bedrock does use some custom client wrapper internally:

https://github.com/BoundaryML/baml/blob/af245134b39a48b7513f4510c666fbbc577e77eb/engine/baml-runtime/src/internal/llm_client/primitive/aws/custom_http_client.rs#L38-L44

So I wonder if that could cause the bug but if it's also present in the standard OpenAI provider I think the problem is elsewhere. Named clients (those defined as client<llm> Name in Baml) seem to be created only once and then cached (thus reusing connection pool when making more requests with the same client). However, both dynamic clients (those added with ClientRegistry in Python) and shorthand clients (using client "openai/gpt-4o" in Baml functions) are both created once per request:

https://github.com/BoundaryML/baml/blob/af245134b39a48b7513f4510c666fbbc577e77eb/engine/baml-runtime/src/lib.rs#L1613-L1620

Ok, let me try a couple things more maybe I can find the issue.

Oct 12 '25 22:10 antoniosarosi

Ok this is interesting, so far I've been able to reproduce this:

1. Custom client definition with base_url coming from an environment variable.

✅ Concurrent.

client<llm> ConcurrencyTestClient {
  provider openai-generic
  options {
    base_url env.CONCURRENT_SERVER_URL // THIS WORKS
    model "concurrency-test"
    api_key env.OPENAI_API_KEY
  }
}

2. Dynamic client definition with base_url passed in ClientRegistry.

✅ Concurrent.

async def concurrent():
    cr = ClientRegistry()
    cr.add_llm_client("ConcurrencyTestClient", "openai-generic", {
        "model": "concurrency-test",
        "base_url": "http://127.0.0.1:9000/v1",
    })
    cr.set_primary("ConcurrencyTestClient")

    tasks = [b.CallFunction({"client_registry": cr}) for _ in range(0, 20)]
    
    await asyncio.gather(*tasks)

3. Custom client definition with static base_url.

❌ Not concurrent.

client<llm> ConcurrencyTestClient {
  provider openai-generic
  options {
    base_url "http://127.0.0.1:9000/v1" // THIS DOES NOT WORK
    model "concurrency-test"
    api_key env.OPENAI_API_KEY
  }
}

4. Shorthand client.

❌ Not concurrent.

function ConcurrencyTest() -> string {
    client "openai/gpt4-o" // THIS DOES NOT WORK
    prompt "Write a poem"
}

@justinthelaw can you confirm you're using either hardcoded static URL or shorthand client? When sending requests to the local server those are killing concurrency.

Oct 12 '25 23:10 antoniosarosi

My bad, I messed up the parameters of the test server and that's why tests where failing. Unfortunately I still cannot reproduce this.

Oct 13 '25 18:10 antoniosarosi

@antoniosarosi we've been using the dynamic registration option, which registers the appropriate named provider based on environment variables parsed by Pydantic settings.

See the example below:

from baml_py import ClientRegistry

from backend.deps import get_async_inference_service
from backend.settings import settings

service = get_async_inference_service()

cr = ClientRegistry()

if settings.INFERENCE_PROVIDER in ['llamacpp', 'vllm']:
    if settings.OPENAI_ENDPOINT and settings.OPENAI_API_KEY:
        cr.add_llm_client(
            name='StructuredLlm',
            provider='openai',  # to work with BAML this provider must be 'openai'
            options={
                'base_url': settings.OPENAI_ENDPOINT,
                'api_key': settings.OPENAI_API_KEY,
                'model': settings.STRUCTURED_LLM,
                'temperature': 0,
                'max_tokens': service.STRUCTURED_RESPONSE_MAX_TOKENS,
            },
        )
        cr.add_llm_client(
            name='FastLlm',
            provider='openai',  # to work with BAML this provider must be 'openai'
            options={
                'base_url': settings.OPENAI_ENDPOINT,
                'api_key': settings.OPENAI_API_KEY,
                'model': settings.FAST_LLM,
                'temperature': 0,
                'max_tokens': service.STRUCTURED_RESPONSE_MAX_TOKENS,
            },
        )
    else:
        raise ValueError(
            'OPENAI_ENDPOINT and OPENAI_API_KEY must be set when using llamacpp or vllm inference provider'
        )
elif settings.INFERENCE_PROVIDER == 'aws-bedrock':
    cr.add_llm_client(
        name='StructuredLlm',
        provider='aws-bedrock',  # to work with BAML this provider must be 'aws-bedrock'
        options={
            'model': settings.STRUCTURED_LLM,
            'inference_configuration': {
                'temperature': 0,
                'max_tokens': service.STRUCTURED_RESPONSE_MAX_TOKENS,
            },
        },
    )
    cr.add_llm_client(
        name='FastLlm',
        provider='aws-bedrock',  # to work with BAML this provider must be 'aws-bedrock'
        options={
            'model': settings.FAST_LLM,
            'inference_configuration': {
                'temperature': 0,
                'max_tokens': service.STRUCTURED_RESPONSE_MAX_TOKENS,
            },
        },
    )
else:
    raise ValueError(f'Unsupported inference provider: {settings.INFERENCE_PROVIDER}')

We make the call to the provider here:

function ChainOfThoughtCall(system_message: string,
                            messages: map<string, string>[]
                            ) -> BamlChainOfThoughtResponse {
    client StructuredLlm
    prompt #"
        {{ _.role('system') }}
        {{ system_message }}
        {% for message in messages %}
            {{ _.role(message.role) }}
            {{ message.content }}
        {% endfor %}
        {{ ctx.output_format }}
    "#
}

Oct 14 '25 12:10 justinthelaw

[bug] Sequential Processing due to Connection Pool Limits

Connection Pool Limits Cause Sequential Processing Instead of Concurrent Execution

Summary

Environment

Issue Details

Expected Behavior

Actual Behavior

Evidence from Production Logs

User Code (Correctly Implemented)

Relationship to Previous Work

Root Cause Analysis

Impact

Proposed Solutions

Additional Context