MLServer icon indicating copy to clipboard operation
MLServer copied to clipboard

Queue management from 1.0.0 to 1.1.0 versions

Open alvarorsant opened this issue 3 years ago • 1 comments
trafficstars

Good morning! I noticed you have changed (in rest server) , queue request management from 1.0.0 to 1.1.0, adding in the last one python queues. I would like to know how is the request management within rest server in version 1.0.0. As I could have look is the operative system in charge of this task ? Could you explain more in detail ?

Thanks in advance.

alvarorsant avatar Sep 21 '22 10:09 alvarorsant

Hey @alvarorsant ,

Just to make sure I understand, are you referring to the queue used to feed requests into the parallel worker pool?

If that's the case, previously we were using the ProcessPoolExecutor, which is part of the standard library. However, under the hood, it also uses regular Python queues (you can check the actual implementation here: https://github.com/python/cpython/blob/3.10/Lib/concurrent/futures/process.py).

In MLServer 1.1.0, we got rid of the ProcessPoolExecutor and instead created our own management layer. Main reason for this change was that we wanted to move from having a separate pool per model, to having a shared pool where all models are loaded. Since we then had a shared pool, we also needed to take care of loading / unload the right models dynamically on those workers, therefore ProcessPoolExecutor just wasn't enough for our needs.

The overall mechanics of both ProcessPoolExecutor and our new InferencePool implementation should be quite similar when it comes to request management though. As in, on both cases you have a Queue object used to communicate request payloads between the main and children processes.

Also, for full context, the main motivation to move to a shared pool was to reduce the memory overhead that each pool was introducing (which was raised in #434). Previously we were seeing massive increases on memory usage each time a new model was loaded. This was mainly due to the overhead caused by initialising that model's own pool (through ProcessPoolExecutor). After moving to a single shared pool, this extra memory usage has massively reduced.

adriangonz avatar Sep 21 '22 11:09 adriangonz