MLServer
MLServer copied to clipboard
Bad performance with long requests
Hello!
I've been testing different ways of making requests, and I've noticed that the performance gets noticeably worse when using requests with many inputs.
For example, I have a pandas Dataframe with 39 columns. I can encode it with the PandasCodec resulting in a request with 39 inputs, or I can encode it in base64 resulting in a request with only one input.
Surprisingly, the performance is much better using single input requests. With a relatively demanding model, the performance is about 30% better with base64 encoding.
Unfortunately, I cannot share the model and the concrete requests, but I have tried to reproduce the example as good as possible in this repository. With requests as similar as possible, but without using any model. In this example, you can see very clearly how the size of the requests affects the performance.
https://github.com/pablobgar/test-mlserver
I am using adaptive batching, can it be related to the request batching process, or do you have any idea why there could be this difference in performance?
Thanks for your help.
Hey @pablobgar ,
This is great! Thanks a lot for doing that initial research and sharing a reproducible example.
I haven't looked much into it yet, but one potential cause could be the parsing / serialisation itself. Adding more inputs is quite verbose, so it will end up in a larger JSON blob, which may explain the difference. One quick test we could run to rule this one out is to send the same request but through gRPC (where serialisation and parsing is way faster).
Besides that, only thing I can think of would be some kind of loop through the inputs that's inefficient. This would require profiling the code though.
Hey @adriangonz,
I have added some k6 scripts to test the same scenario with grpc. As far as I can see, the differences are bigger than using rest. And even more if adaptive batching is enabled.