DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[REQUEST] Saving or Exporting `InferenceEngine`s to support model scaling in production

Open joehoover opened this issue 3 years ago • 5 comments

Is your feature request related to a problem? Please describe. I want to use DeepSpeed Inference in production and I am wondering whether there are suggested solutions for reducing the scaling latency introduced by init_inference(). It takes a considerable amount of time to initialize an inference engine and this will make it difficult to dynamically scale model instances.

Frankly, I assume there already is a solution, but I have not found a description in the documentation.

Describe the solution you'd like I want to reduce or eliminate the latency introduced by the deepspeed.init_inference() call that is used in the DS Inference tutorials. For example, is it possible to export/save an initialized inference engine?

Describe alternatives you've considered I have not considered any alternatives, but I would be open to suggestions.

Additional context I am new to DeepSpeed and I do not know how the time requirements of inference engine initialization vary across model types and sizes. I was motivated to open this issue after testing the GPT-J inference kernels --- I didn't time init_inference(), but it certainly took long enough to pose an obstacle for efficient scaling.

EDIT: initialization takes about 57 seconds on my system (AWS SageMaker ml.g4dn.12xlarge instance).

joehoover avatar Jan 12 '22 16:01 joehoover

Hi @joehoover,

Thanks for using bringing up this challenge. I will definitely look into this and share more information on this.

Best, Reza

RezaYazdaniAminabadi avatar Jan 13 '22 21:01 RezaYazdaniAminabadi

Hi @RezaYazdaniAminabadi ,

I'm interested into this request as well. Do you have any update/information to share yet?

Best, Nico

naxty avatar Apr 01 '22 14:04 naxty

Any updates or workarounds on this? DeepSpeed provides great benefits for inference, but if loading a model takes over a minute, it defeats the purpose in production.

joaopcm1996 avatar May 26 '23 10:05 joaopcm1996

Adding @lekurile to this conversation.

awan-10 avatar Jun 01 '23 18:06 awan-10

is there any solution ?

manmay-nakhashi avatar Jun 13 '23 15:06 manmay-nakhashi