MiniGPT-4 icon indicating copy to clipboard operation
MiniGPT-4 copied to clipboard

Authors, any suggestion on why inference speed is slow

Open gordonhu608 opened this issue 1 year ago • 1 comments

The inference time of your model takes around 5-15 seconds for a single image, text pair question which is very slow compared to a concurrent work LLaVA, which only takes a few seconds. Do you know why this could happen and any suggestions for future researchers to mitigate this issue? The only difference during inference is that your model has a Q former, could it be why it's so slow?

gordonhu608 avatar Apr 23 '23 03:04 gordonhu608

Hi,do you have any idea?

igodogi avatar May 12 '23 12:05 igodogi

I think it has to do with the usage of 8-bit for LLaMA text generation. By default, MiniGPT-4 now uses bitsandbyte's load_in_8bit to reduce the VRAM footprint. However, this results, at least in my experience, in slower generation speeds with many different LLM's including LLaMA. For normal text generation, I've observed speeds roughly half that of normal FP16 or 4-bit GPTQ models, and in the case of MiniGPT-4, running it without 8-bit results in generation times of 2-5 seconds for detailed descriptions depending on the length of the output vs 20+ seconds with.

This may be due to inefficient processing of the image embeddings, I'm not sure, but if you're concerned with generation speeds and have a GPU with at least 24GB of VRAM, you can run this without 8-bit by using the 7B version and setting low_resource to False in the minigpt4_eval.yaml config file located in the eval_configs folder.

Ideally, I'd like to run this in 4bit, which would allow even the 13B version to run on a card with less than 24GB VRAM, but the only mention of this that I've seen is #13, and the person who claimed to have made it work never elaborated on how they did it. Sometime soon I plan to try to get it to work myself, and I'll report back if I'm successful.

Unintensored avatar May 24 '23 08:05 Unintensored

Can confirm the comment above from @Unintensored. Got it running with low_resource=False on Hugging Face Nvidia A10G large. (But not A10G small, which threw errors during build - they have equal 24GB VRAM, but large has 46GB RAM vs 15GB, and 12 vCPU vs 4.)

richkirsch avatar May 24 '23 14:05 richkirsch