BentoML
BentoML copied to clipboard
bug: Batching async run issue
Describe the bug
i am using a custom runner model
i am testing batching by enabling batching on the runner and testing it using bentoml server --production flag i have configured a timeout of 60 seconds and batching configuration of 100ms timeout and 1000 as max batch size
when every i perform requests with batching enabled , i am getting the below error
2022-11-08T13:26:15+0000 [INFO] [api_server:1] 127.0.0.1:35046 (scheme=http,method=POST,path=/v1/get_intents,type=application/json,length=91) (status=200,type=application/json,length=20) 5802.654ms (trace=456b2c38d391f41db16db8d862fcc898,span=ed66b68b017a42fe,sampled=0)
Traceback (most recent call last):
File "/workspace/personality_framework/personality_service/bento_service.py", line 238, in get_intent
result=await runner1.is_positive.async_run([{"sentence":query}])
File "/tmp/e2/lib/python3.8/site-packages/bentoml/_internal/runner/runner.py", line 53, in async_run
return await self.runner._runner_handle.async_run_method( # type: ignore
File "/tmp/e2/lib/python3.8/site-packages/bentoml/_internal/runner/runner_handle/remote.py", line 207, in async_run_method
raise ServiceUnavailable(body.decode()) from None
bentoml.exceptions.ServiceUnavailable: Service Busy
however without async the same code works properly without any issues
Below is the test file
import os
import json
import bentoml
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from bentoml.io import Text, JSON
model = KNeighborsClassifier()
iris = load_iris()
X = iris.data[:, :4]
Y = iris.target
model.fit(X, Y)
class QueryAnalysisRunnable(bentoml.Runnable):
SUPPORTED_RESOURCES = ("nvidia.com/gpu","cpu")
SUPPORTS_CPU_MULTI_THREADING = False
def __init__(self, context):
self.model = model
@bentoml.Runnable.method(batchable=True)
def is_positive(self, input_text):
print("batch size is ",len(input_text))
scores = self.model.predict(input_text)
return scores
ctx = {}
ctx["system_properties"] = {}
ctx["system_properties"]["model_dir"] = "model/bentoml/"
runner1 = bentoml.Runner(
QueryAnalysisRunnable,
name="testhandler",
runnable_init_params={
"context": ctx,
},
max_batch_size=1000,
max_latency_ms=250,
)
svc = bentoml.Service("test_service", runners=[runner1])
@svc.api(input=JSON(),
output=JSON(),
route="v1/test"
)
def test(input: JSON) -> JSON:
output_json = {}
data = input
try:
#data = request.get_json()
print("\nInput JSON: {}".format(data))
if all(key in data for key in ['query', 'timezone', 'skill_id']):
query = data['query']
timezone = data['timezone']
skill_id = data['skill_id']
a=[[0.1,2,3,4]]
result= runner1.is_positive.run(a)
#print(result)
#output_json={}
output_json["result"]=result
output_json['status'] = 'success'
else:
output_json['status'] = 'failure'
except Exception as e:
import traceback
traceback.print_exc()
output_json['status'] = 'failure'
Log.e(TAG, "An exception occurred: {}".format(e))
return output_json
@svc.api(input=JSON(),
output=JSON(),
route="v1/test1"
)
async def test1(input: JSON) -> JSON:
output_json = {}
data = input
try:
#data = request.get_json()
print("\nInput JSON: {}".format(data))
if all(key in data for key in ['query', 'timezone', 'skill_id']):
query = data['query']
timezone = data['timezone']
skill_id = data['skill_id']
a=[[0.1,2,3,4]]
result= await runner1.is_positive.async_run(a)
#print(result)
#output_json={}
output_json["result"]=result
output_json['status'] = 'success'
else:
output_json['status'] = 'failure'
except Exception as e:
import traceback
traceback.print_exc()
output_json['status'] = 'failure'
Log.e(TAG, "An exception occurred: {}".format(e))
return output_json
To reproduce
to reproduce the issues , save the above code in test.py file pip install bentoml scikit-learn==1.0.2 numpy scipy
run the code bentoml --production test.py --api-workers=1
run the below test code
-
synchronous mode url="http://0.0.0.0:3000/v1/test" for x in {1..500}; do curl --request POST $url --header 'Content-Type: application/json' --data-raw '{"query":"are you a real robot","timezone":"Asia/Kolkata", "lang_code":"en","skill_id":"0"}' & done
-
with async mode using async_run method to call the runner url="http://0.0.0.0:3000/v1/test1" for x in {1..500}; do curl --request POST $url --header 'Content-Type: application/json' --data-raw '{"query":"are you a real robot","timezone":"Asia/Kolkata", "lang_code":"en","skill_id":"0"}' & done
The async run method causes lot of errors with above error
Expected behavior
with async mode should be able to handle all the requests successfully and support higher throughput compared to non synchronous mode
Traceback (most recent call last):
File "/workspace/personality_framework/personality_service/test.py", line 108, in test1
result= await runner1.is_positive.async_run(a)
File "/workspace/personality_framework/personality_service/r2/lib/python3.8/site-packages/bentoml/_internal/runner/runner.py", line 53, in async_run
return await self.runner._runner_handle.async_run_method( # type: ignore
File "/workspace/personality_framework/personality_service/r2/lib/python3.8/site-packages/bentoml/_internal/runner/runner_handle/remote.py", line 207, in async_run_method
raise ServiceUnavailable(body.decode()) from None
bentoml.exceptions.ServiceUnavailable: Service Busy
without async run the scripts executes successfully with no error
Environment
configuration.yml
runners:
timeout: 60
logging:
access:
enabled: false
request_content_length: true
request_content_type: true
response_content_length: true
response_content_type: true
metrics:
enabled: false
namespace: bentoml_runner
batching:
enabled: true
max_batch_size: 100
max_latency_ms: 1000
api_server:
timeout: 60
same issue here, any solution?
Same issue, it was quite smooth for me when I was using 0.13-lts but started having this issue when 1.0.
I got the same issue. Any advice?
Facing same issue, is there way to fix it?
@przdev I tried the steps in the original post and couldn't manage to reproduce it. Since it is several months old, can you provide the steps to reproduce it?
I am having the same issue, has anyone found a solution?
This looks like the max_latency_ms
is just set too low, have you tried increasing that value?