DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Does Zero-Inference support TP?

Open preminstrel opened this issue 1 year ago • 11 comments

preminstrel avatar Apr 16 '24 15:04 preminstrel

Zero-inference is composable with Megatron-style TP. That is the TP is implemented in the client.

tjruwase avatar Apr 16 '24 16:04 tjruwase

Hello, is that composable with kv cache offloading? But I cannot find its API... @tjruwase Thanks!

preminstrel avatar Apr 16 '24 16:04 preminstrel

I mean, only offload kv cache while keeping whole model weights on GPUs. All the example code looks like for a single GPU.

preminstrel avatar Apr 16 '24 16:04 preminstrel

I assume you are referring to kv cache offloading in the latest zero-inference. We did not evaluate with TP, but I expect it should work.

tjruwase avatar Apr 16 '24 16:04 tjruwase

Thanks! But how can I make it work? Do you have example command?

preminstrel avatar Apr 16 '24 16:04 preminstrel

I tried to set num_gpus to 2, but seems it will make two identical model on each GPU at the same time.

preminstrel avatar Apr 16 '24 16:04 preminstrel

This is because your model has not been pre-processed by a TP framework like Megatron. ZeRO-Inference will not perform the TP slicing on any model.

tjruwase avatar Apr 16 '24 16:04 tjruwase

Thanks! But how can I make it work? Do you have example command?

Below are commands for single-gpu inference with kv-cache-offload. https://github.com/microsoft/DeepSpeedExamples/tree/master/inference/huggingface/zero_inference#token-generation-with-zero-inference

tjruwase avatar Apr 16 '24 16:04 tjruwase

Yes, you are right! Thanks! And single-gpu inference with kv-cache-offload's performance is really nice! But I have a question:

I found that fork of transformers actually allocate buffer for KV cache, which seems not compatible. It will still allocate self.num_heads for the kv cache on each GPU.

So basically there is not an official implementation for TP + Zero-Inference + KV offload that I can run it directly. Please correct me if I am wrong.

Are you planning to add this feature in the future? Btw, will TP helps under this setting? (since the attn computation are all on CPU anyway)

Thanks!

preminstrel avatar Apr 16 '24 17:04 preminstrel

Glad that kv-cache-offload performance might be good for your scenario.

Yes, you are correct there is no official implementation of TP + ZeRO-Inference + KV Offload. Unfortunately, we don't have bandwidth for this right now. But we welcome community contributions.

Yes, I agree that TP won't add much benefit to kv offload since (1) memory pressure is mostly reduced, and (2) attn computation is on CPU.

tjruwase avatar Apr 16 '24 18:04 tjruwase

Thank you very much! Nice work!

preminstrel avatar Apr 16 '24 18:04 preminstrel