optimum-intel
optimum-intel copied to clipboard
Create infer request per inference to enable concurrency
This is just a draft PR for now to start a discussion.
It modifies forward
calls to create inference request every time instead of reusing only one, created along with the model. This way multiple inferences can be run at the same time allowing higher overall throughput.
Why create a new inference request every time forward()
is called? I think there's some overhead with that... not sure if it matters. Can't we just create a request when the model is compiled and keep it as a member variable?
@mzegla what is the example code for running in such "mode"? Does it require multiply number of objects/calls into the optimum in separate threads? Do you have any specific benchmarks to back up the gains?
Originally the "compile model approach" was introduced to remove overhead of creating multiple instances of requests (as @ngaloppo mentioned, here is the exact PR https://github.com/huggingface/optimum-intel/pull/265). Right now it is not clear to me how your changes should be utilized on higher-level user code.
Another idea that comes to my mind is introducing parallel_requests=True/False
flag in either __call__
or even __init__
function of the model. This will allow to adjust pipelines to expected behaviors and decide which gains are more important -- sequential gains (one time IR creation) or parallel gains (multiple requests on-demand, BTW even the use of AsyncInferQueue
as pool of requests can be considered here but this is depending on scenarios).