BentoML
BentoML copied to clipboard
--production flag slows down my requests
Describe the bug
When I add the --production
flag to the bentoml serve
command, model serving becomes extremely slow compared to without the flag. The --production
flag seems to make many predictions in parallel and thus slow down processing. The exact same data (and outputs) are obtained in both cases.
To Reproduce
Expected behavior
I would expect the time to process an input to be identical (or faster) with --production
Screenshots/Logs
Serve the model with:
bentoml serve my_model:latest
Ping server:
curl -X POST -H "content-type: application/json" --data "path_to_file" http://127.0.0.1:5000/predict
Output:
Serve the model with:
bentoml serve my_model:latest --production
Ping server:
curl -X POST -H "content-type: application/json" --data "path_to_file" http://127.0.0.1:5000/predict
Output:
Environment:
- OS: Ubuntu 18.05
- Python Version [Python 3.9.7]
- BentoML Version [bentoml @ git+https://github.com/bentoml/BentoML.git@3b4bc6a7e061285908a8c4b248c57ad52224c2d2]
Hi, jerpint. Thanks for reporting this. Would you mind sharing the bento service definition snippet? For example:
# bento.py
import bentoml
import bentoml.sklearn
import numpy as np
from bentoml.io import NumpyNdarray
# Load the runner for the latest ScikitLearn model we just saved
iris_clf_runner = bentoml.sklearn.load_runner("iris_clf:latest")
# Create the iris_classifier service with the ScikitLearn runner
# Multiple runners may be specified if needed in the runners array
# When packaged as a bento, the runners here will included
svc = bentoml.Service("iris_classifier", runners=[iris_clf_runner])
# Create API function with pre- and post- processing logic with your new "svc" annotation
@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def predict(input_ndarray: np.ndarray) -> np.ndarray:
# Define pre-processing logic
result = iris_clf_runner.run(input_ndarray)
# Define post-processing logic
return result
You can hide sensitive information as you want.
It looks something like this (obfuscated a little):
import bentoml
import torch
import bentoml.sklearn
import numpy as np
import json
from yaml import load
import yaml
from bentoml.io import Text, JSON
from my_project import make_predict_dataloader
runner = bentoml.pytorch.load_runner("my_model:latest")
svc = bentoml.Service("my_app", runners=[runner])
CONFIG_FILEPATH = "config.yaml"
with open(CONFIG_FILEPATH, "r") as stream:
hyper_params = load(stream, Loader=yaml.FullLoader)
@svc.api(input=Text(), output=JSON())
def predict(filename: str) -> dict:
# Generate the dataloader with the soundfile
predict_dataloader = make_predict_dataloader(filename, hyper_params)
# Iterate through each batch and gather results
all_probs = []
for data in predict_dataloader:
pred = runner.run_batch(data)
probs = torch.nn.functional.softmax(pred, dim=1)
all_probs.extend(probs.cpu().numpy())
# Collect and post-process results
all_probs = np.array(all_probs)
avg_prob = np.sum(all_probs, 0) / len(all_probs)
result = {
"prob1": round((avg_prob[1] * 100), 2),
"prob2": round((avg_prob[0] * 100), 2),
}
return json.dumps(result)
short of finding a solution, is there a way to run the container without the --production
flag? It seems to be the default behaviour when running docker run MY_IMAGE
and I observe the same slow behaviour and logs
so this make_predict_dataloader
function return a list of data? have you try to just pass the dataloader
in and let run_batch
handles inference that way?
make_predict_dataloader
returns predict_dataloader
, which is a typical pytorch torch.utils.data.dataloader.DataLoader
class.
The idea is that in my case, a "single prediction" actually needs to iterate over many samples, hence why I create a dataset+dataloader for my sample. Think of 1 prediction being e.g. a batched prediction on a collection of images - I iterate through each, and collect the result for each batch (and so for each image), the final result being an aggregation of all predictions
If I pass directly the dataloader
to run_batch
, it fails, as the model doesn't expect to get a dataloader as an input.
would it be preferable to wrap everything in a dataset instead (without a batch size) and pass that directly to run_batch
?
short of finding a solution, is there a way to run the container without the
--production
flag? It seems to be the default behaviour when runningdocker run MY_IMAGE
and I observe the same slow behaviour and logs
It is possible to start the container with your own command, including running bentoml serve
without the flag: docker run -p 5000:5000 YOUR_IMAGE_TAG bentoml serve $BENTO_PATH
. Although this is not recommended for production usage.
The reason for it being slower under --production
is most likely due to lots of round trips between the API server process and runner process. In this case, I'd recommend to rewrite your code in either one of the following styles, depending on how large is the data returned from data loader:
- Call
run_batch
once with a batch of input data, as @aarnphm suggested, if the returned data is relatively small and your model can run inference on the entire data batch at once:
@svc.api(input=Text(), output=JSON())
def predict(filename: str) -> dict:
... # pre processing
preds = runner.run_batch([data for data in predict_dataloader])
... # post processing
- use many async
run
for each data point, this can take advantage of multiple Runner processes, and the actually batch size feed into the runner#run_batch will be determined in runtime. (aka it can handle very large batch returned from predict_dataloader, as long as the latency allows)
@svc.api(input=Text(), output=JSON())
async def predict(filename: str) -> dict:
... # pre processing
preds = await asyncio.gather(
*(runner.async_run(data) for data in predict_dataloader)
)
... # post processing
Hope that helps. We are working on related docs to help users understand how Runners works, why it helps make your model serving workload more efficient and how to avoid some common pitfalls.
thanks, i've tried implementing 1 using
def predict(filename: str) -> dict:
... # pre processing
preds = runner.run_batch(torch.cat([data for data in predict_dataloader]))
this works, and i obtain the same output results. However, i still notice a significant speed decrease when passing the --production
flag instead of --reload
flag. However, this time, the observed logs are the same
hi @jerpint may I ask how did you measure the speed decrease? Is it the first request has significant longer wait? Were you able to measure an average over a period of time? that would be very helpful for us to understand the issue here
this works, and i obtain the same output results. However, i still notice a significant speed decrease when passing the --production flag instead of --reload flag. However, this time, the observed logs are the same
Note that the observed logs are not guaranteed to be the same when --production
is applied. When --production
is applied, the runner will be scheduled in its own process and adaptive batching will be enabled.
@parano nothing more than just sending a curl
to the server, and seeing how long between hitting enter and getting the response. In both cases, they are local (on localhost
), the code is the exact same and all that changes is the flag at the end of the command. In one case, using --reload
, the response is almost instant. In the other, using --production
, the response is very slow (on the order of seconds and very noticeable). Other than changing the flag, everything else is exactly the same (same data, same response, same results, same outputs). The command is:
bentoml serve ./bento.py:svc --production
vs bentoml serve ./bento.py:svc --reload
Hi guys!
I experienced the same behavior as @jerpint .
My tests were made with https://k6.io/ tool.
When running the API call burst, without --production
flag, with duration of 10sec and one vu (virtual user), I get an average of 147 req/sec.
Running the same test with --production
flag, also for 10sec, I got an average of 70 req/sec.
And now running the same two tests, but now adding a --vus 100
flag (100 virtual users) to k6, I got 261 req/sec and 79 req/sec respectively.
My model is a pickable_model wrapping a lightfm model. I can help with more tests if you need.
Can you guys try with the latest changes on the main branch?
just did a fresh install with the latest main
branch, still observing the same behaviour
Hi, people! Experiencing similar behavior here.
- OnnxRuntime using GPU on a resnet18-like model
- Using 64 concurrent requests
Running without --production
gives around 200/sec.
Running with --production
gives around 149.sec,
Running the same code without BentoML in a single thread using batches of 32 gives around 350/sec.
Observations:
- Non-
production
uses a single process -
production
uses multi processing. I tried different setting from the default to specifying--api-workers
of 0, 1, 2 up to 8. without improvement - non-
production
CPU load ~70% on all cores, GPU load ~50% -
production
CPU load ~70% on 1 core and ~40% on other cores, GPU load ~30% - The GPU load in production is around 30%
- I tried different settings for
max_batch_size
in theto_runner
with little improvement - Increasing the number of concurrent requests to 128, 256 or 1024 does not improve throughput rate
Theory:
- Somehow the multi-processing approach used in
production
makes the total throughput lower? - I also used an older version of BentoML before (0.12.x) a while ago and I remember that in this version I was able to get the GPU load close to 100% after some tuning to the number of batch workers. Because there is some pre- and post- processing I want to use the multi-process approach of BentoML. Ideally the pre-/post-processing is done parallel on the CPU while the model invocations can be done as often as needed to keep the GPU close to 100% load. Currently neither CPU nor GPU is at 100% even if I increase the conccurrent requests to 128, 256 or even 1024.
I will keep searching and digging. But also following this issue.
Thanks for the detailed report @udevnl!
Just wanted to pop in here and let you all know that this is now being worked on, and performance is going to be much more of a priority moving forward.
We've managed to reproduce, and think we've found a major cause of this slowdown, and are working on fixing it right now.
We've just merged #2267, which should speed things up quite a bit, especially for models with larger input. While I think there's still a little bit of room to improve everywhere in terms of speed, at the very least production throughput should be faster now than development.
Would you be able to try out the changes on the main branch? Or alternatively I'd be happy to ping you when we release this fix.
Thank you for looking into this so quickly! Just did some benchmarks - I installed the merge using pip install git+https://github.com/bentoml/BentoML.git
and double checked the updated source in site-packages. Given the same experimental setup:
- non-
production
throughput ~210/sec -
production
throughput ~210/sec
It looks like the throughput is now similar in production, nice!
I did some additional tests which gave me some more observations:
- If I skip the call to the model inference runner, the throughput is ~540/sec (average image size 120kb)
- If I skip the call to the model inference runner, the throughput is ~740/sec (average image size 3.6kb)
- It seems the logging to the console also slows things down on my machine. If I redirect output to /dev/null this increases the above from ~740/sec to ~870/sec.
So this gives an idea of the latency caused by everything except the model runner.
When trying to push the maximum throughput even higher I found the following:
- The highest I got is ~270/sec on
production
using reduced image size of 3.6kb using default settings - The GPU load gets close to 100% in non-
production
and to around 40% inproduction
- The CPU load stays below 40% on all cores in all cases (
production
and non-production
)
Overall conclusions:
- The fix helped solve the slower production performance!
- I am still going to look for how to increase the performance even more since the GPU load still stays ~40% and the CPU load is also low. I think there is some headroom somewhere in the form of latency. Because I am only happy when either CPU or GPU get maxed out ;-) If I find something I will post it.
Thank you for the fix
Some performance graphs...
Best run non-production
:
Best run production
:
Thank you so much for the detailed profiling @udevnl. We are actively working to address this problem. Would still love to hear more about the experiments 🎉
@udevnl Thanks for your help with benchmarking BentoML performance. If neither CPU nor GPU is saturated, the bottleneck could be in the API or Runner code itself. Here are common pitfalls to watch for.
- Not using async API function definition. This could cause the API function to become the bottleneck.
- Having CPU intensive compute in the runner. Since the runner is a singleton, having CPU intensive runner would cause the runner to saturate a single core of CPU while leave other cores relatively under utilized. CPU intensive preprocessing logic should be done in the API Servers to leverage multi-core CPU as much as possible.
If possible, we can get on a call to investigate more together. You can find us in on Slack. If you could share a reproducer, that'd help, too.
I'm also affected, but the development version has not fixed the issues. I'm measuring time spent on the computation in custom runners (directly in the runners) and the actual calls of the runners in the API function. In the dev setting serve --reload
are the measurements the same: the overhead between API and runners is negligible. In the serve --production
setting there is a 10 ms hit between time measured in runners and in API function. I'm sending a ~600x1000 px image to the runners.
@udevnl Thanks for your help with benchmarking BentoML performance. If neither CPU nor GPU is saturated, the bottleneck could be in the API or Runner code itself. Here are common pitfalls to watch for.
- Not using async API function definition. This could cause the API function to become the bottleneck.
- Having CPU intensive compute in the runner. Since the runner is a singleton, having CPU intensive runner would cause the runner to saturate a single core of CPU while leave other cores relatively under utilized. CPU intensive preprocessing logic should be done in the API Servers to leverage multi-core CPU as much as possible.
If possible, we can get on a call to investigate more together. You can find us in on Slack. If you could share a reproducer, that'd help, too.
Hey! I did not get the second point about "Having CPU intensive compute in the runner". The runner will execute the inference if I got it correctly; that is CPU bound. So, a CPU bound operation should be definitely performed on a runner, if I got it right... 🤔
@francobocciDH I read this an got confused too. Shouldn't it be in the runners?
Sorry about the super slow response, we haven't kept on top of our issues very well. I think that was referencing the case where the inference was on GPU.
As this issue is now quite old, and I believe we've mostly solved the production interface issues, I'm going to close it now. If you're running into this, please feel free to open a new issue!