LLaMA-Factory Faild to save the gptq quantized weight on Qwen2 72B.

Reminder

[X] I have read the README and searched the existing issues.

System Info

uantizing model.layers blocks : 100%|█████████████████████████████████████████████████████████████████████| 80/80 [1:01:40<00:00, 46.26s/it]WARNING:optimum.gptq.quantizer:Found modules on cpu/disk. Using Exllama/Exllamav2 backend requires all the modules to be on GPU. Setting `disable_exllama=True`/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/modeling_utils.py:4565: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead  warnings.warn([INFO|quantization_config.py:699] 2024-07-09 09:26:00,183 >> You have activated exllama backend. Note that you can get better inference speed using exllamav2 kernel by setting `exllama_config`.07/09/2024 09:26:00 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.INFO:llamafactory.model.model_utils.attention:Using torch SDPA for faster training and inference.
07/09/2024 09:26:00 - INFO - llamafactory.model.loader - all params: 2,492,735,488INFO:llamafactory.model.loader:all params: 2,492,735,488[INFO|configuration_utils.py:472] 2024-07-09 09:26:00,219 >> Configuration saved in /home/zhaopengfeng/qwen-0706/checkpoint-372/gptq/config.json[INFO|configuration_utils.py:769] 2024-07-09 09:26:00,220 >> Configuration saved in /home/zhaopengfeng/qwen-0706/checkpoint-372/gptq/generation_config.json/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/modeling_utils.py:2525: UserWarning: Attempting to save a model with offloaded modules. Ensure that unallocated cpu memory exceeds the `shard_size` (5GB default)  warnings.warn(Traceback (most recent call last):  File "/home/zhaopengfeng/anaconda3/envs/llama_factory/bin/llamafactory-cli", line 8, in <module>    sys.exit(main())  File "/home/zhaopengfeng/LLaMA-Factory/src/llamafactory/cli.py", line 87, in main
    export_model()
  File "/home/zhaopengfeng/LLaMA-Factory/src/llamafactory/train/tuner.py", line 91, in export_model
    model.save_pretrained(
  File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2661, in save_pretrained
    shard = {tensor: state_dict[tensor] for tensor in tensors}
  File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2661, in <dictcomp>
    shard = {tensor: state_dict[tensor] for tensor in tensors}
NameError: free variable 'state_dict' referenced before assignment in enclosing scope

peft                              0.11.1   
torch                             2.3.0
llamafactory                      0.8.3.dev0

非常紧急，希望能知道原因😫

Reproduction

My script

### model
model_name_or_path: /home/zhaopengfeng/kddcup_repo/ckpt_output/qwen72b/all_0706/checkpoint-372/merge
template: qwen

### export
export_dir: /home/zhaopengfeng/qwen-0706/checkpoint-372/gptq
export_quantization_bit: 4
export_quantization_dataset: /home/zhaopengfeng/kddcup_repo/task-specific/gptq_data_dev2.json
export_size: 2
export_device: auto
export_legacy_format: false

前面量化全都跑完了，倒在了最后save上

Expected behavior

No response

Others

No response

Jul 09 '24 09:07 fzp0424

补充 auto_gptq 0.7.1 transformers 4.42.2

Yi-34B gptq是正常的，Qwen2 72B不行

Jul 09 '24 09:07 fzp0424

我修改了 init_kwargs["max_memory"] = {0: "20GIB", 1: "20GIB", 2: "20GIB", 3:"20GIB", 'cpu': "250GIB"} 机器RAM 251 GB， 4x48G A40

Jul 09 '24 09:07 fzp0424

@fzp0424 您好请问问题有解决吗

Jul 18 '24 09:07 glowwormX

@fzp0424 您好请问问题有解决吗

换大RAM机器，把量化过程全都搬到CPU上（速度会慢）or 用4卡A100

Jul 19 '24 03:07 fzp0424