LLaVA-NeXT
LLaVA-NeXT copied to clipboard
Issue with 4-bit Quantization for LLaVA-NeXT-Video-32B Model on A100-40GB GPU
Hello, I am trying to run the lmms-lab/LLaVA-NeXT-Video-32B-Qwen model on an A100-40GB GPU. However, I encounter an OOM issue when loading the model in its default configuration. To address this, I attempted to enable 4-bit quantization using the bitsandbytes library by modifying my script as follows:
pretrained = "lmms-lab/LLaVA-NeXT-Video-32B-Qwen"
model_name = "llava_qwen"
device_map = "auto"
# Load the model with proper configuration
tokenizer, model, image_processor, max_length = load_pretrained_model(
pretrained,
None,
model_name,
load_in_8bit=False, # Ensure 8-bit quantization is disabled
load_in_4bit=True # Enable 4-bit quantization
)
model.eval()
However, when I run the script, I encounter the following error message:
ValueError: .to is not supported for 4-bit or 8-bit bitsandbytes models.
Please use the model as it is, since the model has already been set to the correct devices and casted to the correct dtype.
Could you clarify how to properly enable 4-bit quantization for the lmms-lab/LLaVA-NeXT-Video-32B-Qwen model?