VILA icon indicating copy to clipboard operation
VILA copied to clipboard

How to quantize NVILA with awq?

Open Kibry-spin opened this issue 11 months ago • 3 comments

(vila) kirdo@kirdo-System-Product-Name:~/LLM/llm-awq$ python -m awq.entry --model_path /home/kirdo/LLM/NVILA-8B-Video/ --w_bit 4 --q_group_size 128 --run_awq --dump_awq awq_cache/$MODEL-w4-g128.pt Quantization config: {'zero_point': True, 'q_group_size': 128}

  • Building model /home/kirdo/LLM/NVILA-8B-Video/ [2024-12-30 19:26:13,027] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.05it/s] You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda'). Repo card metadata block was not found. Setting CardData to empty. Token indices sequence length is longer than the specified maximum sequence length for this model (57053 > 16384). Running this sequence through the model will result in indexing errors
  • Split into 59 blocks Traceback (most recent call last): File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/kirdo/LLM/llm-awq/awq/entry.py", line 352, in main() File "/home/kirdo/LLM/llm-awq/awq/entry.py", line 293, in main model, enc = build_model_and_enc(args.model_path) File "/home/kirdo/LLM/llm-awq/awq/entry.py", line 199, in build_model_and_enc awq_results = run_awq( File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/kirdo/LLM/llm-awq/awq/quantize/pre_quant.py", line 136, in run_awq model.llm(samples.to(next(model.parameters()).device)) File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1164, in forward outputs = self.model( File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 871, in forward position_embeddings = self.rotary_emb(hidden_states, position_ids) File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/kirdo/miniconda3/envs/vila/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 163, in forward freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

I just follow the https://github.com/mit-han-lab/llm-awq.git to quantize NVILA-8B-Video,but when running awq search, i got an problem shown above.

Kibry-spin avatar Dec 30 '24 11:12 Kibry-spin

Looks like some components of the model were allocated on CPU unexpectedly. @Louym Could you please take a look at your convenience? Thanks.

ys-2020 avatar Jan 08 '25 02:01 ys-2020

Thank you for reaching out. It seems a bit unusual. If you're looking to obtain the quantized weights or AWQ scales of LLM part, you might need to add /llm to the --model_path. You shoould also use --vila-20 for NVILA models. You can refer to the instructions here to easily quantize NVILA.

I’ve tested our commands, and they appear to work without any errors. Could you provide more details about the issue you’re encountering?

Louym avatar Jan 10 '25 08:01 Louym

@Kibry-spin does your issue still continue? The AWQ developers @Louym @ys-2020 are willing to help if you still got questions

Lyken17 avatar Mar 19 '25 16:03 Lyken17