DeepSpeed
DeepSpeed copied to clipboard
[REQUEST] Saving or Exporting `InferenceEngine`s to support model scaling in production
Is your feature request related to a problem? Please describe.
I want to use DeepSpeed Inference in production and I am wondering whether there are suggested solutions for reducing the scaling latency introduced by init_inference()
. It takes a considerable amount of time to initialize an inference engine and this will make it difficult to dynamically scale model instances.
Frankly, I assume there already is a solution, but I have not found a description in the documentation.
Describe the solution you'd like
I want to reduce or eliminate the latency introduced by the deepspeed.init_inference()
call that is used in the DS Inference tutorials. For example, is it possible to export/save an initialized inference engine?
Describe alternatives you've considered I have not considered any alternatives, but I would be open to suggestions.
Additional context
I am new to DeepSpeed and I do not know how the time requirements of inference engine initialization vary across model types and sizes. I was motivated to open this issue after testing the GPT-J inference kernels --- I didn't time init_inference()
, but it certainly took long enough to pose an obstacle for efficient scaling.
EDIT: initialization takes about 57 seconds on my system (AWS SageMaker ml.g4dn.12xlarge
instance).
Hi @joehoover,
Thanks for using bringing up this challenge. I will definitely look into this and share more information on this.
Best, Reza
Hi @RezaYazdaniAminabadi ,
I'm interested into this request as well. Do you have any update/information to share yet?
Best, Nico
Any updates or workarounds on this? DeepSpeed provides great benefits for inference, but if loading a model takes over a minute, it defeats the purpose in production.
Adding @lekurile to this conversation.
is there any solution ?