MLServer
MLServer copied to clipboard
Slow compare to seldon-core-microservice with multiple workers
I have ported an existing seldon-core model service into mlserver but found the performance dropped a lot.
Below is my load test config and result of Rest endpoint seldon-core 2 replica, 2cpu, 2g ram, 4 gunicorn workers 2 threads, about 100 request/s mlserver 2 replica, 8cpu, 4g ram, 4 parallel workers, about 8 requests/s
not sure is it due to https://github.com/SeldonIO/MLServer/issues/353 ?
version: mlserver==0.5.3
Hey @cksac ,
Thanks for sharing those results. From our own internal benchmarking MLServer is generally faster than the previous server used in Seldon Core, so it would be great to understand more about your environment to learn why is it slower in this case.
Is this a custom model? Ideally, it would be great if you could share the code. Conscious that this can be delicate sometimes though so, if that's not possible, could you share as many details as you can? E.g. does your custom code leverage asyncio? For how long did the test run? How large are generally the request payloads? etc.
Alternatively, if this is using one of the pre-packaged runtimes, could you share which one was it using? Also, any extra details around the model that can help us replicate the low throughput would be appreciated.
PS: #353 should only become an issue if your custom code uses asyncio. However, even if this was the case, this would just "drop" MLServer's performance to the same level of the previous inference server in Seldon Core.
you can found an demo repo here https://github.com/cksac/mlserver-bench and follow the README.md to run the test. It will start ngnix as load balancer and 2 mlserver instances.
I use time.sleep to simulate model prediction call (cpu bounded) https://github.com/cksac/mlserver-bench/blob/main/bench_server/models/model.py#L34
with time.sleep(0.1) in predict method about 35request/s, after comment out time.sleep(0.1), about 130request/s
Hey @cksac,
Thanks a lot for putting the time on getting that benchmarking repo together.
I've been spending some time looking at it, and was just wondering that with 2 parallel workers per instance and 100ms of blocking time (at least) per request, doesn't 35 reqs / sec sound about right? That is, if we take into account only the blocking time, each "parallel worker thread" would only be able to execute 10 of them in a second, so 35 reqs / sec doesn't sound too far off. Is there anything I'm missing on that reasoning?
On top of that, from what I've seen, time.sleep() also tends to overshoot the waiting time on sub-second timeouts.
Just for reference, we do run benchmarks from time to time (on both custom and simple SKLearn models). You can check them out on the ./benchmarking folder. On those, average request times tend to be on the lower 50ms.
On your original benchmark, where you reported the performance difference against the current server in Seldon Core, were you using a similar test setup?
Hey @cksac ,
We'll be closing this one due to inactivity. However, please do reopen it if this is still an issue for you.