LLaVA-NeXT Issue with 4-bit Quantization for LLaVA-NeXT-Video-32B Model on A100-40GB GPU

Issue with 4-bit Quantization for LLaVA-NeXT-Video-32B Model on A100-40GB GPU

Open Rachel0901 opened this issue 11 months ago • 1 comments

Hello, I am trying to run the lmms-lab/LLaVA-NeXT-Video-32B-Qwen model on an A100-40GB GPU. However, I encounter an OOM issue when loading the model in its default configuration. To address this, I attempted to enable 4-bit quantization using the bitsandbytes library by modifying my script as follows:


pretrained = "lmms-lab/LLaVA-NeXT-Video-32B-Qwen"
model_name = "llava_qwen"
device_map = "auto"

# Load the model with proper configuration
tokenizer, model, image_processor, max_length = load_pretrained_model(
    pretrained,
    None,
    model_name,
    load_in_8bit=False,  # Ensure 8-bit quantization is disabled
    load_in_4bit=True    # Enable 4-bit quantization
)
model.eval()

However, when I run the script, I encounter the following error message:

ValueError: .to is not supported for 4-bit or 8-bit bitsandbytes models.
Please use the model as it is, since the model has already been set to the correct devices and casted to the correct dtype.

Could you clarify how to properly enable 4-bit quantization for the lmms-lab/LLaVA-NeXT-Video-32B-Qwen model?

Dec 09 '24 19:12 Rachel0901

LLaVA-NeXT LLaVA-NeXT copied to clipboard

Issue with 4-bit Quantization for LLaVA-NeXT-Video-32B Model on A100-40GB GPU

LLaVA-NeXT
LLaVA-NeXT copied to clipboard