optimum-intel icon indicating copy to clipboard operation
optimum-intel copied to clipboard

Create infer request per inference to enable concurrency

Open mzegla opened this issue 1 year ago • 2 comments

This is just a draft PR for now to start a discussion.

It modifies forward calls to create inference request every time instead of reusing only one, created along with the model. This way multiple inferences can be run at the same time allowing higher overall throughput.

mzegla avatar Dec 21 '23 10:12 mzegla

Why create a new inference request every time forward() is called? I think there's some overhead with that... not sure if it matters. Can't we just create a request when the model is compiled and keep it as a member variable?

ngaloppo avatar Jan 03 '24 19:01 ngaloppo

@mzegla what is the example code for running in such "mode"? Does it require multiply number of objects/calls into the optimum in separate threads? Do you have any specific benchmarks to back up the gains?

Originally the "compile model approach" was introduced to remove overhead of creating multiple instances of requests (as @ngaloppo mentioned, here is the exact PR https://github.com/huggingface/optimum-intel/pull/265). Right now it is not clear to me how your changes should be utilized on higher-level user code.

Another idea that comes to my mind is introducing parallel_requests=True/False flag in either __call__ or even __init__ function of the model. This will allow to adjust pipelines to expected behaviors and decide which gains are more important -- sequential gains (one time IR creation) or parallel gains (multiple requests on-demand, BTW even the use of AsyncInferQueue as pool of requests can be considered here but this is depending on scenarios).

jiwaszki avatar Jan 05 '24 06:01 jiwaszki