BentoML bug: Uneven request distribution across workers in BentoML service

Describe the bug

While running the BentoML service with 4 workers (each with 1 thread), it appears that the incoming HTTP requests are not evenly balanced across the worker processes. I'd like to know if there's any configuration I'm missing.

2025-08-06T06:00:53+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:43376 (scheme=http,method=POST,path=/classify,type=application/json,length=81) (status=200,type=application/json,length=10) 201.820ms (trace=2cf1120afa437d5ebe8f9792eb3519b0,span=2507eaf9014aba6e,sampled=0,service.name=Predictor)
2025-08-06T06:00:53+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:43608 (scheme=http,method=POST,path=/classify,type=application/json,length=81) (status=200,type=application/json,length=10) 202.028ms (trace=5565f4c5a624f3cda4c4835b2853f727,span=0e51701b5e614c96,sampled=0,service.name=Predictor)
2025-08-06T06:00:53+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:43816 (scheme=http,method=POST,path=/classify,type=application/json,length=81) (status=200,type=application/json,length=10) 201.912ms (trace=d475a2d347df94ca4edd3da858d16905,span=7b44259dafc4f72b,sampled=0,service.name=Predictor)
2025-08-06T06:00:53+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44090 (scheme=http,method=POST,path=/classify,type=application/json,length=79) (status=200,type=application/json,length=10) 201.699ms (trace=a539c617542d29cb80aa70dc8b2ee42d,span=3d80958ead45d70c,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44270 (scheme=http,method=POST,path=/classify,type=application/json,length=80) (status=200,type=application/json,length=10) 201.748ms (trace=fdb5dc2c2773f0d87a2ff20acdd45eb3,span=d83e432d948c33dc,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44538 (scheme=http,method=POST,path=/classify,type=application/json,length=80) (status=200,type=application/json,length=10) 201.562ms (trace=dcdaccdab55fec02b2aff73d8227d733,span=9f3ab4fe948da5df,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44716 (scheme=http,method=POST,path=/classify,type=application/json,length=79) (status=200,type=application/json,length=10) 201.492ms (trace=84fc319d34bfa9337a2eb08314015323,span=d0b1e9ead2bb78eb,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44962 (scheme=http,method=POST,path=/classify,type=application/json,length=78) (status=200,type=application/json,length=10) 201.729ms (trace=ae1647b92484b84a80ba7cee49d9667c,span=537587c62b086b50,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:45190 (scheme=http,method=POST,path=/classify,type=application/json,length=80) (status=200,type=application/json,length=10) 201.974ms (trace=ac90d9510dc60c5162309c900911d846,span=a1842bae213272ce,sampled=0,service.name=Predictor)
2025-08-06T06:00:55+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:45382 (scheme=http,method=POST,path=/classify,type=application/json,length=79) (status=200,type=application/
...

Here is the simple statistics.

Full logs are attached here. bentoml_test_server.log

Expected behavior

All the requests should evenly distributed among the workers.

To reproduce

Prepare simple BentoML class.

# service3.py
import bentoml
import logging
import time

bentoml_logger = logging.getLogger("bentoml")


@bentoml.service(workers=4, threads=1)
class Predictor:
    def __init__(self):
        pass

    @bentoml.api
    def classify(self, input_ids: list[list[int]]) -> list[float]:
        """
        input_ids example:
        [[82, 13, 59, 45, 97, 36, 74, 6, 91, 12, 33, 19, 77, 68, 40, 50]]
        """
        time.sleep(0.2)

        return [0.1, 0.2]

Run the BentoML service

$ bentoml serve service3:Predictor

Check all the workers are running through htop.
Prepare client code to generate HTTP requests.

import numpy as np
import requests
import time

def classify_input_ids():
    input_ids = np.random.randint(0, 100, (1, 16)).tolist()
    response = requests.post(
        "http://bentoml-test-server:3000/classify",
        json={"input_ids": input_ids},
        headers={
            "accept": "text/plain",
            "Content-Type": "application/json",
            "Connection": "close"
        }
    )
    print("Status Code:", response.status_code)
    print("Response:", response.text)

def run_for_duration(seconds: int):
    end_time = time.time() + seconds
    count = 0
    while time.time() < end_time:
        classify_input_ids()
        count += 1
    print(f"Sent {count} requests in total.")

if __name__ == "__main__":
    duration = int(input("Enter the duration to send requests (in seconds): "))
    run_for_duration(duration)

Run the client code.

$ python3 bento_request_en.py 
Enter the duration to send requests (in seconds): 180
...

Check the logs of BentoML service.

2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44538 (scheme=http,method=POST,path=/classify,type=application/json,length=80) (status=200,type=application/json,length=10) 201.562ms (trace=dcdaccdab55fec02b2aff73d8227d733,span=9f3ab4fe948da5df,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44716 (scheme=http,method=POST,path=/classify,type=application/json,length=79) (status=200,type=application/json,length=10) 201.492ms

Environment

bentoml: 1.4.19

Aug 06 '25 06:08 chlee1016

We do not have control over which worker the data will go through: https://circus.readthedocs.io/en/latest/tutorial/step-by-step/#installation:~:text=Note%20The%20load%20balancing%20is%20operated%20by%20the%20operating%20system%20so%20you%E2%80%99re%20getting%20the%20same%20speed%20as%20any%20other%20pre%2Dfork%20web%20server%20like%20Apache%20or%20NGinx.%20Circus%20does%20not%20interfer%20with%20the%20data%20that%20goes%20through.

How does it matter in your case?

Aug 06 '25 07:08 frostming

‌‌‌@chlee1016 May I ask what your operating system is? If it's Linux, what is the kernel version?

Aug 06 '25 08:08 bojiang

@bojiang Sure, here are the additional information. OS: Debian GNU/Linux 12 Kernel version: 5.10.109-1.20220408.el7.x86_64

Aug 06 '25 09:08 chlee1016

@chlee1016 Thanks. Another thing is, currently the requests are sent one by one. Would you like to test it again with some real concurrent requests?

Aug 06 '25 09:08 bojiang

We are still following up @chlee1016

Aug 08 '25 03:08 bojiang

@bojiang , I have conducted the test about concurrent situation, but got similar result.

Prepare 4 pods running on different nodes.
Run python3 bento_request_en.py in each pods.
Check the result.

Similar to previous test, it seems that almost all of the requests are passed to #3 worker. The log file is attached here.

bentoml_test_concurrent_requests.log

Aug 08 '25 03:08 chlee1016

2. Run python3 bento_request_en.py in each pods.

The balancing result is not correct since you are sending requests in sequence

Aug 08 '25 03:08 frostming

@frostming

The balancing result is not correct since you are sending requests in sequence

I ran the experiment because I expected the requests to be distributed across multiple workers, even though it is not controlled by application. Do you mean that I should make the service API asynchronous and have the client send requests asynchronously as well?

Aug 12 '25 09:08 chlee1016

Do you mean that I should make the service API asynchronous and have the client send requests asynchronously as well?

Just make the client send requests concurrently

Aug 12 '25 09:08 frostming

@frostming

Click to expand (client code)

import numpy as np
import requests
import time
import threading

URL = "http://localhost:8080/classify"

def classify_input_ids():
    input_ids = np.random.randint(0, 100, (1, 16)).tolist()
    resp = requests.post(
        URL,
        json={"input_ids": input_ids},
        headers={
            "accept": "text/plain",
            "Content-Type": "application/json",
            "Connection": "close",
        },
        timeout=10,
    )
    return resp.status_code, resp.text

def worker_thread(name: str, end_time: float, counter: list, lock: threading.Lock):
    local_count = 0
    while time.time() < end_time:
        try:
            status, _ = classify_input_ids()
            local_count += 1
        except Exception:
            pass
    with lock:
        counter[0] += local_count

def run_concurrent(duration: int, num_workers: int):
    end_time = time.time() + duration
    counter = [0]
    lock = threading.Lock()
    threads = []

    for i in range(num_workers):
        t = threading.Thread(target=worker_thread, args=(f"worker-{i+1}", end_time, counter, lock), daemon=True)
        threads.append(t)
        t.start()

    for t in threads:
        t.join()

    print(f"Total {counter[0]} requests sent. (Workers: {num_workers}, Duration: {duration}s)")

if __name__ == "__main__":
    try:
        duration = int(input("Enter duration in seconds: "))
        workers = int(input("Enter number of workers: "))
    except ValueError:
        print("Please enter an integer.")
        raise

    run_concurrent(duration, workers)

Results (10 secs, 4 client threads) Worker 1: 8 requests Worker 2: 52 requests Worker 3: 2 requests Worker 4: 171 requests

Summary: The result is same with previous test.

While testing, I found that there was a similar discussion in uvicorn.

From what I understand, in BentoML v1.4, the workers accept connections from a shared listening socket, and the operating system is responsible for load balancing incoming connection requests.

Q. My goal is to serve on Kubernetes by running multiple workers within a single container using BentoML. Would this approach be considered an anti-pattern?

I’d also like to know if there are any alternative solutions to address this. 🙏

Aug 21 '25 02:08 chlee1016