Improve GPU memory usage but slower inference speed?

Open ys-zong opened this issue 2 years ago • 0 comments

Hi, thanks for the nice work! I tried to use the following code to enable LM-Infinite for Llama following Readme,

model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', torch_dtype=torch.bfloat16, device_map="cuda", low_cpu_mem_usage=True)

from models.llama import convert_llama_model
model = convert_llama_model(model, 4096, 10)

and then do the inference as usual. The GPU memory usage is lower than using regular attention but the inference speed becomes much slower (like 10x slower). I'm using A100 GPU and I checked the GPU-Util: it's very low ~10%. I I wonder if you have any idea why it happens? Many thanks.

Apr 16 '24 22:04 ys-zong