LLaVA icon indicating copy to clipboard operation
LLaVA copied to clipboard

Quantization instructions

Open lucasjinreal opened this issue 1 year ago • 3 comments

Hello, increasingly, LLMs are offering INT4 quantization outcomes, enabling more users to run these models even without a high-end GPU. May I inquire if there are plans to introduce INT4 quantization support for this exceptional MLLM? I genuinely look forward to it, and I strongly believe that numerous users would also benefit greatly from this feature.

lucasjinreal avatar Feb 01 '24 02:02 lucasjinreal

This section of the Readme gives some information about how to load the model in 4bit or 8bit quantized format: https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#launch-a-model-worker-4-bit-8-bit-inference-quantized It should be as simple as appending " --load-4bit" to the command line arguments for any of the examples provided that use the command line.

For other uses in the readme, such as under Quick Start With HuggingFace, you will need to edit the python files such that they load the models in the specified quantization in the same way as cli.py does.

This is from cli.py which uses the command line arguments to specify the quantization to load in: tokenizer, model, image_processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name, args.load_8bit, args.load_4bit, device=args.device)

This is from run_llava.py which currently does not use the command line arguments to specify the quantization to load in: model_name = get_model_name_from_path(args.model_path) tokenizer, model, image_processor, context_len = load_pretrained_model( args.model_path, args.model_base, model_name )

Loading in 4bit, for example, should be as easy as changing the line in run_llava.py to: model_name = get_model_name_from_path(args.model_path) tokenizer, model, image_processor, context_len = load_pretrained_model( args.model_path, args.model_base, model_name, False, True )

JoeySalmons avatar Feb 01 '24 04:02 JoeySalmons

AWQ enables model save when it quantized, so that users only need to download the int4 weights. By saying no need to perserve hight-end GPU normally means we can't load such a gaint model for quantize at all. In otherwords, can a int4 model be provided or at least ask @TheBloker to do this?

lucasjinreal avatar Feb 01 '24 06:02 lucasjinreal

Has anyone been able to verify that bitsandbytes, for example, better transformer, flash attention 2, will work with the new version 1.6 llava models?

BBC-Esq avatar Feb 02 '24 00:02 BBC-Esq