MLServer
MLServer copied to clipboard
effeciency with multiprocess: not work when setting 'parallel_workers > 1'
Hi, I'm trying to improve throughput of my server which was running by MLServer, and I got to know that I can set 'parallel_workers > 1' to enable parallel. Hence I set it in settings.json as below:
{
"parallel_workers": 10,
"debug": "true"
}
Then I use ab(apche benchmark) to test my server, at the same time, I use top to monitor the usage of CPU and MEMORY.
I can see that, the server really fork 10 process. However, only 1 process really worked while the others did nothing???
the result of ab test showed that 'parallel_workers=1' has the same latency with 'parallel_workers=10': 'parallel_workers=10'
ab -n 10000 -c 10 -T application/json -p sklearn-mlserver.json http://localhost:8080/v2/models/sklearn/infer
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests
Server Software: uvicorn
Server Hostname: localhost
Server Port: 8080
Document Path: /v2/models/sklearn/infer
Document Length: 184 bytes
Concurrency Level: 10
Time taken for tests: 25.962 seconds
Complete requests: 10000
Failed requests: 0
Total transferred: 6220000 bytes
Total body sent: 3340000
HTML transferred: 1840000 bytes
Requests per second: 385.18 [#/sec] (mean)
Time per request: 25.962 [ms] (mean)
Time per request: 2.596 [ms] (mean, across all concurrent requests)
Transfer rate: 233.97 [Kbytes/sec] received
125.63 kb/s sent
359.60 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.2 0 3
Processing: 4 26 4.5 24 69
Waiting: 4 24 4.3 23 67
Total: 5 26 4.5 25 69
Percentage of the requests served within a certain time (ms)
50% 25
66% 26
75% 27
80% 28
90% 30
95% 34
98% 42
99% 46
100% 69 (longest request)
'parallel_workers=1'
ab -n 10000 -c 10 -T application/json -p sklearn-mlserver.json http://localhost:8080/v2/models/sklearn/infer
This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests
Server Software: uvicorn
Server Hostname: localhost
Server Port: 8080
Document Path: /v2/models/sklearn/infer
Document Length: 184 bytes
Concurrency Level: 10
Time taken for tests: 28.567 seconds
Complete requests: 10000
Failed requests: 0
Total transferred: 6220000 bytes
Total body sent: 3340000
HTML transferred: 1840000 bytes
Requests per second: 350.06 [#/sec] (mean)
Time per request: 28.567 [ms] (mean)
Time per request: 2.857 [ms] (mean, across all concurrent requests)
Transfer rate: 212.63 [Kbytes/sec] received
114.18 kb/s sent
326.81 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.3 0 5
Processing: 9 28 10.1 25 110
Waiting: 8 27 9.6 24 107
Total: 9 28 10.2 25 111
Percentage of the requests served within a certain time (ms)
50% 25
66% 27
75% 28
80% 29
90% 35
95% 49
98% 66
99% 76
100% 111 (longest request)
Hey @ooooona ,
Depending on your benchmark settings, there could be not enough traffic to run on the other workers in parallel. Generally, MLServer will do a round-robin across workers. This is the case for MLServer > 1.2.0 (which version of MLServer are you using?).
However, if there aren't enough concurrent requests (e.g. when using a single client to send requests, or requests are processed too fast), each worker will complete processing each request before the next one comes in - effectively looking like only one of them is working.
hi @adriangonz ,
- I checked my mlserver:
mlserver, version 1.3.0.dev3
. Actullay I build the image from git with commit 'eaa056371befccf74c66efc62192ffdd3c4a254e'. - As you can see my testing command, my total request count is 10,000, for the first case, the concurrency is 10. I think the traffic is quit big. Today, I even tried with total request 100,000 and concurrency 100.
top
showed the same: only one process has CPU usage ~100%, while other ~0%, and the latency increased to 390ms<99p>(1 concurrency 12ms<99p>, 10 concurrency 45ms<99p>).
I also tried the same testing way on seldon-core-microservice, its multiprocess really worked, not only at CPU usage, but also improving throughput(reduce latency). So I think there might be something wrong with mlserver.
Hey @ooooona ,
Thanks for providing those details.
Could you share more info on the type of requests you are sending? How large are these?
Deserialisation happens on the main process - so if these are large requests, that could be a potential bottleneck.
hi @adriangonz , sorry for my late reply. my message, my request is quite small:
$ cat sklearn-mlserver.json
{
"inputs": [
{
"name": "args",
"shape": [1,4],
"datatype": "FP32",
"data": [10.1,13.5,1.4,0.2]
}
]
}