bug: Uneven request distribution across workers in BentoML service
Describe the bug
While running the BentoML service with 4 workers (each with 1 thread), it appears that the incoming HTTP requests are not evenly balanced across the worker processes. I'd like to know if there's any configuration I'm missing.
2025-08-06T06:00:53+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:43376 (scheme=http,method=POST,path=/classify,type=application/json,length=81) (status=200,type=application/json,length=10) 201.820ms (trace=2cf1120afa437d5ebe8f9792eb3519b0,span=2507eaf9014aba6e,sampled=0,service.name=Predictor)
2025-08-06T06:00:53+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:43608 (scheme=http,method=POST,path=/classify,type=application/json,length=81) (status=200,type=application/json,length=10) 202.028ms (trace=5565f4c5a624f3cda4c4835b2853f727,span=0e51701b5e614c96,sampled=0,service.name=Predictor)
2025-08-06T06:00:53+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:43816 (scheme=http,method=POST,path=/classify,type=application/json,length=81) (status=200,type=application/json,length=10) 201.912ms (trace=d475a2d347df94ca4edd3da858d16905,span=7b44259dafc4f72b,sampled=0,service.name=Predictor)
2025-08-06T06:00:53+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44090 (scheme=http,method=POST,path=/classify,type=application/json,length=79) (status=200,type=application/json,length=10) 201.699ms (trace=a539c617542d29cb80aa70dc8b2ee42d,span=3d80958ead45d70c,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44270 (scheme=http,method=POST,path=/classify,type=application/json,length=80) (status=200,type=application/json,length=10) 201.748ms (trace=fdb5dc2c2773f0d87a2ff20acdd45eb3,span=d83e432d948c33dc,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44538 (scheme=http,method=POST,path=/classify,type=application/json,length=80) (status=200,type=application/json,length=10) 201.562ms (trace=dcdaccdab55fec02b2aff73d8227d733,span=9f3ab4fe948da5df,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44716 (scheme=http,method=POST,path=/classify,type=application/json,length=79) (status=200,type=application/json,length=10) 201.492ms (trace=84fc319d34bfa9337a2eb08314015323,span=d0b1e9ead2bb78eb,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44962 (scheme=http,method=POST,path=/classify,type=application/json,length=78) (status=200,type=application/json,length=10) 201.729ms (trace=ae1647b92484b84a80ba7cee49d9667c,span=537587c62b086b50,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:45190 (scheme=http,method=POST,path=/classify,type=application/json,length=80) (status=200,type=application/json,length=10) 201.974ms (trace=ac90d9510dc60c5162309c900911d846,span=a1842bae213272ce,sampled=0,service.name=Predictor)
2025-08-06T06:00:55+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:45382 (scheme=http,method=POST,path=/classify,type=application/json,length=79) (status=200,type=application/
...
Here is the simple statistics.
Full logs are attached here. bentoml_test_server.log
Expected behavior
All the requests should evenly distributed among the workers.
To reproduce
- Prepare simple BentoML class.
# service3.py
import bentoml
import logging
import time
bentoml_logger = logging.getLogger("bentoml")
@bentoml.service(workers=4, threads=1)
class Predictor:
def __init__(self):
pass
@bentoml.api
def classify(self, input_ids: list[list[int]]) -> list[float]:
"""
input_ids example:
[[82, 13, 59, 45, 97, 36, 74, 6, 91, 12, 33, 19, 77, 68, 40, 50]]
"""
time.sleep(0.2)
return [0.1, 0.2]
- Run the BentoML service
$ bentoml serve service3:Predictor
-
Check all the workers are running through htop.
-
Prepare client code to generate HTTP requests.
import numpy as np
import requests
import time
def classify_input_ids():
input_ids = np.random.randint(0, 100, (1, 16)).tolist()
response = requests.post(
"http://bentoml-test-server:3000/classify",
json={"input_ids": input_ids},
headers={
"accept": "text/plain",
"Content-Type": "application/json",
"Connection": "close"
}
)
print("Status Code:", response.status_code)
print("Response:", response.text)
def run_for_duration(seconds: int):
end_time = time.time() + seconds
count = 0
while time.time() < end_time:
classify_input_ids()
count += 1
print(f"Sent {count} requests in total.")
if __name__ == "__main__":
duration = int(input("Enter the duration to send requests (in seconds): "))
run_for_duration(duration)
- Run the client code.
$ python3 bento_request_en.py
Enter the duration to send requests (in seconds): 180
...
- Check the logs of BentoML service.
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44538 (scheme=http,method=POST,path=/classify,type=application/json,length=80) (status=200,type=application/json,length=10) 201.562ms (trace=dcdaccdab55fec02b2aff73d8227d733,span=9f3ab4fe948da5df,sampled=0,service.name=Predictor)
2025-08-06T06:00:54+0000 [INFO] [entry_service:Predictor:3] 172.16.140.42:44716 (scheme=http,method=POST,path=/classify,type=application/json,length=79) (status=200,type=application/json,length=10) 201.492ms
Environment
bentoml: 1.4.19
We do not have control over which worker the data will go through: https://circus.readthedocs.io/en/latest/tutorial/step-by-step/#installation:~:text=Note%20The%20load%20balancing%20is%20operated%20by%20the%20operating%20system%20so%20you%E2%80%99re%20getting%20the%20same%20speed%20as%20any%20other%20pre%2Dfork%20web%20server%20like%20Apache%20or%20NGinx.%20Circus%20does%20not%20interfer%20with%20the%20data%20that%20goes%20through.
How does it matter in your case?
@chlee1016 May I ask what your operating system is? If it's Linux, what is the kernel version?
@bojiang Sure, here are the additional information. OS: Debian GNU/Linux 12 Kernel version: 5.10.109-1.20220408.el7.x86_64
@chlee1016 Thanks. Another thing is, currently the requests are sent one by one. Would you like to test it again with some real concurrent requests?
We are still following up @chlee1016
@bojiang , I have conducted the test about concurrent situation, but got similar result.
- Prepare 4 pods running on different nodes.
- Run
python3 bento_request_en.pyin each pods. - Check the result.
Similar to previous test, it seems that almost all of the requests are passed to #3 worker. The log file is attached here.
2. Run
python3 bento_request_en.pyin each pods.
The balancing result is not correct since you are sending requests in sequence
@frostming
The balancing result is not correct since you are sending requests in sequence
I ran the experiment because I expected the requests to be distributed across multiple workers, even though it is not controlled by application. Do you mean that I should make the service API asynchronous and have the client send requests asynchronously as well?
Do you mean that I should make the service API asynchronous and have the client send requests asynchronously as well?
Just make the client send requests concurrently
@frostming
Click to expand (client code)
import numpy as np
import requests
import time
import threading
URL = "http://localhost:8080/classify"
def classify_input_ids():
input_ids = np.random.randint(0, 100, (1, 16)).tolist()
resp = requests.post(
URL,
json={"input_ids": input_ids},
headers={
"accept": "text/plain",
"Content-Type": "application/json",
"Connection": "close",
},
timeout=10,
)
return resp.status_code, resp.text
def worker_thread(name: str, end_time: float, counter: list, lock: threading.Lock):
local_count = 0
while time.time() < end_time:
try:
status, _ = classify_input_ids()
local_count += 1
except Exception:
pass
with lock:
counter[0] += local_count
def run_concurrent(duration: int, num_workers: int):
end_time = time.time() + duration
counter = [0]
lock = threading.Lock()
threads = []
for i in range(num_workers):
t = threading.Thread(target=worker_thread, args=(f"worker-{i+1}", end_time, counter, lock), daemon=True)
threads.append(t)
t.start()
for t in threads:
t.join()
print(f"Total {counter[0]} requests sent. (Workers: {num_workers}, Duration: {duration}s)")
if __name__ == "__main__":
try:
duration = int(input("Enter duration in seconds: "))
workers = int(input("Enter number of workers: "))
except ValueError:
print("Please enter an integer.")
raise
run_concurrent(duration, workers)
Results (10 secs, 4 client threads) Worker 1: 8 requests Worker 2: 52 requests Worker 3: 2 requests Worker 4: 171 requests
Summary: The result is same with previous test.
While testing, I found that there was a similar discussion in uvicorn.
From what I understand, in BentoML v1.4, the workers accept connections from a shared listening socket, and the operating system is responsible for load balancing incoming connection requests.
Q. My goal is to serve on Kubernetes by running multiple workers within a single container using BentoML. Would this approach be considered an anti-pattern?
I’d also like to know if there are any alternative solutions to address this. 🙏