LLaMA-Factory
LLaMA-Factory copied to clipboard
fsdp+qlora mixtral 8x22B: RuntimeError: Only Tensors of floating point and complex dtype can require gradients
Reminder
- [X] I have read the README and searched the existing issues.
Reproduction
I'm using the latest llmtuner 0.7.0 with following libraries versions: transformers>=4.39.1 accelerate>=0.28.0 bitsandbytes>=0.43.0 and I get errors:
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/llmtuner/model/loader.py", line 128, in load_model model = AutoModelForCausalLM.from_pretrained(**init_kwargs) File "/opt/conda/envs/ptca/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained return model_class.from_pretrained( File "/opt/conda/envs/ptca/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3677, in from_pretrained ) = cls._load_pretrained_model( File "/opt/conda/envs/ptca/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4104, in _load_pretrained_model new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( File "/opt/conda/envs/ptca/lib/python3.10/site-packages/transformers/modeling_utils.py", line 895, in _load_state_dict_into_meta_model value = type(value)(value.data.to("cpu"), **value.dict) File "/opt/conda/envs/ptca/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 568, in new obj = torch.Tensor._make_subclass(cls, data, requires_grad) RuntimeError: Only Tensors of floating point and complex dtype can require gradients
I have checked the issue: https://github.com/hiyouga/LLaMA-Factory/issues/3206 but doesn't help.
Expected behavior
No response
System Info
No response
Others
No response
provide your training scripts
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch
--config_file ../../accelerate/fsdp_config.yaml
../../../src/train_bash.py
--stage sft
--do_train
--do_eval
--model_name_or_path model_path
--dataset sample_dataset
--dataset_dir data
--template mistral
--finetuning_type lora
--lora_target all
--output_dir output_dir
--overwrite_cache
--overwrite_output_dir
--cutoff_len 16000
--preprocessing_num_workers 2
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--logging_steps 1
--warmup_steps 1
--save_steps 100
--eval_steps 1
--evaluation_strategy steps
--learning_rate 5e-5
--num_train_epochs 2
--max_samples 1000
--val_size 0.2
--quantization_bit 8
--fp16
--upcast_layernorm \
Tried mixtral 8x7B too and got the same error.
fsdp+qlora only accepts 4bit quantization
@hiyouga got it, is there any way to use multiple gpus in a single node to do 8bit qlora?
8bit qlora only supports DDP
I am using qlora with 4-bit quant but somehow i have the same error.
For more detail, this is the config i used :
BitsAndBytesConfig { "_load_in_4bit": true, "_load_in_8bit": false, "bnb_4bit_compute_dtype": "bfloat16", "bnb_4bit_quant_storage": "bfloat16", "bnb_4bit_quant_type": "nf4", "bnb_4bit_use_double_quant": true, "llm_int8_enable_fp32_cpu_offload": false, "llm_int8_has_fp16_weight": false, "llm_int8_skip_modules": null, "llm_int8_threshold": 6.0, "load_in_4bit": true, "load_in_8bit": false, "quant_method": "bitsandbytes" }
And here is a snippet of the code responsible for the error model = AutoModelForCausalLM.from_pretrained( args.model_name_or_path, quantization_config=bnb_config, trust_remote_code=True, attn_implementation="flash_attention_2" if args.use_flash_attn else "eager", torch_dtype=torch_dtype, )
I am 100% sure it is the quant_config that gives me the same error as the first person. The solution of 4-bit didn't do anything in my case.