Fine-tuning Vicuna-7B with Local GPUs: RuntimeError: Expected is_sm80 || is_sm90 to be true, but got false
RuntimeError: Expected is_sm80 || is_sm90 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)
0%| | 0/3096 [00:00<?, ?it/s]
use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...
I think this is a problem with Pytorch. I solved it by installing PyTorch with cuda 11.8 (not 12). if you have cuda 12, the flash-attention package installing problem will come as pytorch is not compiled with 12.0 cuda
@Ejafa your pytorch version? my version: 2.0.0+cu117 besides, wandb: False
I am working on fine-tuning. I will keep updated
I can run 7B-vicuna on my Windows computer with RTX3090 and the performance is good. I am trying to fine-tune it with my own data, but I encountered the following error: RuntimeError: Distributed package doesn't have NCCL built in.
https://github.com/lm-sys/FastChat/issues/294 https://github.com/johnsmith0031/alpaca_lora_4bit/issues/62 https://github.com/pytorch/pytorch/issues/94883
it is estimated that A100/H100 GPU card will be used
#294 johnsmith0031/alpaca_lora_4bit#62 pytorch/pytorch#94883
it is estimated that A100/H100 GPU card will be used
do you solve this problem?
Any solution for this error? Expected is_sm80 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)
But when I turned off gradient_checkpointing = False in Trainer then a new error came AssertionError: use_cache is not supported
I'm facing similar issue
Expected q_dtype == torch::kFloat16 || ((is_sm8x || is_sm90) && q_dtype == torch::kBFloat16) to be true, but got false
@Ejafa were you able to fine tune the model on local GPU?
@eeric did you solve this error? my pytorch version is 1.13.1+cuda11.7 . I got the same error .
I got the same error as @dittops. I find this is an error in flash-attention. For now, I have to comment replace_llama_attn_with_flash_attn() to train with lora.
RuntimeError: Expected q_dtype == torch::kFloat16 || ((is_sm8x || is_sm90) && q_dtype == torch::kBFloat16) to be true, but got false. (Could this error message be improved? If so,
please report an enhancement request to PyTorch.)
My environment is: PyTorch 2.0.0 cuda 11.7 A100 80G
also same error with cuda 11.6 on A100 80G
I'm looking into it. It seem to be an issue with flash attention and not vicuna/fastchat as such
If I replace bf16 True with fp16 True in the script args and also add "fp16": {"enabled": true} to my deepspeed config, the error changes to RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn, and the relevant part of the traceback is
─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /root/FastChat/fastchat/train/train_lora_llama.py:163 in <module> │
│ │
│ 160 │
│ 161 │
│ 162 if __name__ == "__main__": │
│ ❱ 163 │ train() │
│ 164 │
│ │
│ /root/FastChat/fastchat/train/train_lora_llama.py:153 in train │
│ │
│ 150 │ if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")): │
│ 151 │ │ trainer.train(resume_from_checkpoint=True) │
│ 152 │ else: │
│ ❱ 153 │ │ trainer.train() │
│ 154 │ trainer.save_state() │
│ 155 │ │
│ 156 │ # Save states. Weights might be a placeholder in zero3 and need a gather │
│ │
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/transformers/trainer.py:1662 in train │
│ │
│ 1659 │ │ inner_training_loop = find_executable_batch_size( │
│ 1660 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │
│ 1661 │ │ ) │
│ ❱ 1662 │ │ return inner_training_loop( │
│ 1663 │ │ │ args=args, │
│ 1664 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1665 │ │ │ trial=trial, │
│ │
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/transformers/trainer.py:1929 in │
│ _inner_training_loop │
│ │
│ 1926 │ │ │ │ │ with model.no_sync(): │
│ 1927 │ │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │
│ 1928 │ │ │ │ else: │
│ ❱ 1929 │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │
│ 1930 │ │ │ │ │
│ 1931 │ │ │ │ if ( │
│ 1932 │ │ │ │ │ args.logging_nan_inf_filter │
│ │
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/transformers/trainer.py:2715 in │
│ training_step │
│ │
│ 2712 │ │ │ │ scaled_loss.backward() │
│ 2713 │ │ elif self.deepspeed: │
│ 2714 │ │ │ # loss gets scaled under gradient_accumulation_steps in deepspeed │
│ ❱ 2715 │ │ │ loss = self.deepspeed.backward(loss) │
│ 2716 │ │ else: │
│ 2717 │ │ │ loss.backward() │
│ 2718 │ │
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/deepspeed/utils/nvtx.py:15 in │
│ wrapped_fn │
│ │
│ 12 │ │
│ 13 │ def wrapped_fn(*args, **kwargs): │
│ 14 │ │ get_accelerator().range_push(func.__qualname__) │
│ ❱ 15 │ │ ret_val = func(*args, **kwargs) │
│ 16 │ │ get_accelerator().range_pop() │
│ 17 │ │ return ret_val │
│ 18 │
│ │
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/deepspeed/runtime/engine.py:1796 in │
│ backward │
│ │
│ 1793 │ │ │
│ 1794 │ │ if self.zero_optimization(): │
│ 1795 │ │ │ self.optimizer.is_gradient_accumulation_boundary = self.is_gradient_accumula │
│ ❱ 1796 │ │ │ self.optimizer.backward(loss, retain_graph=retain_graph) │
│ 1797 │ │ elif self.amp_enabled(): │
│ 1798 │ │ │ # AMP requires delaying unscale when inside gradient accumulation boundaries │
│ 1799 │ │ │ # https://nvidia.github.io/apex/advanced.html#gradient-accumulation-across-i │
│ │
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2. │
│ py:1890 in backward │
│ │
│ 1887 │ │ │ scaled_loss = self.external_loss_scale * loss │
│ 1888 │ │ │ scaled_loss.backward() │
│ 1889 │ │ else: │
│ ❱ 1890 │ │ │ self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) │
│ 1891 │ │
│ 1892 │ def check_overflow(self, partition_gradients=True): │
│ 1893 │ │ self._check_overflow(partition_gradients) │
│ │
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py │
│ :62 in backward │
│ │
│ 59 │ │
│ 60 │ def backward(self, loss, retain_graph=False): │
│ 61 │ │ scaled_loss = loss * self.loss_scale │
│ ❱ 62 │ │ scaled_loss.backward(retain_graph=retain_graph) │
│ 63 │ │ # print(f'LossScalerBackward: {scaled_loss=}') │
│ 64 │
│ 65 │
│ │
│
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/torch/_tensor.py:487 in backward │
│ │
│ 484 │ │ │ │ create_graph=create_graph, │
│ 485 │ │ │ │ inputs=inputs, │
│ 486 │ │ │ ) │
│ ❱ 487 │ │ torch.autograd.backward( │
│ 488 │ │ │ self, gradient, retain_graph, create_graph, inputs=inputs │
│ 489 │ │ ) │
│ 490 │
│ │
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/torch/autograd/__init__.py:200 in │
│ backward │
│ │
│ 197 │ # The reason we repeat same the comment below is that │
│ 198 │ # some Python versions print out the first line of a multi-line function │
│ 199 │ # calls in the traceback and some print out the last line │
│ ❱ 200 │ Variable._execution_engine.run_backward( # Calls into the C++ engine to run the bac │
│ 201 │ │ tensors, grad_tensors_, retain_graph, create_graph, inputs, │
│ 202 │ │ allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to ru │
│ 203 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
@StrangeTcy did you get any solution?
Bad new! confirmed by flash_attn author, it is impossible to train llama or vicuna using flash_attn with RTX3090 sm86. here is the link to flash_attn author's reply https://github.com/HazyResearch/flash-attention/issues/190#issuecomment-1523375324
I'm facing similar issue
Expected q_dtype == torch::kFloat16 || ((is_sm8x || is_sm90) && q_dtype == torch::kBFloat16) to be true, but got false
@dittops @zhengzangw Do you solve the problem using train_lora with flash_attn.
I successfully resolved the error and completed the training, but the result seems incorrect.
I'm facing similar issue
Expected q_dtype == torch::kFloat16 || ((is_sm8x || is_sm90) && q_dtype == torch::kBFloat16) to be true, but got false
Here's the details:
The error occured on the file lib/python3.10/site-pakacges/flash_attn/flash_attn_interface.py:21
softmax_lse, rng_state, *rest = flash_attn_cuda.fwd(
q, k, v, out, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k, dropout_p,
softmax_scale, False, causal, return_softmax, num_splits, generator
)
where the flash_attn_cuda seems a compiled .so file that unable to view content.
After debug, the q.dtype, k.dtype and v.dtype are all torch.float32.
The error infomation shows that the q.dtype must be torch.float16 or torch.bf16 (if the compute capacity is grater than 8x or 90). (Note: the compute capacity of A100 is sm8, that of 40 series GPU are sm89, which can be found here (nvidia)).
Hence, it seems the error occurred due to flash-attn not supporting tf32.
In the train.sh script, I'v set --fp16 True, --fp16_full_eval True and half_precision_backend "cuda_amp" to suppose the model and inputs can be convert to float16 automatically, but It does not work. (Are there any experts here who can provide guidance on how to set up the through Huggingface args?
To solve this manually, I add the following line in the file site-packages/transformers/trainer.py:
Line 2695: >>> model = model.half()
Another error occurred because the variable generated in the middle would become tf32, which does not match the model (fp16). I made the corresponding modification (but forgot where to make the changes)
Then, the model can be successfully trained!
(I used the deepspeed command, the train_lora.py, close the FSDP, allowed flash-attn)
Seems incorrect result
After trained, the train loss seems too much bigger:
wandb: Run summary:
wandb: train/epoch 2.8
wandb: train/global_step 39
wandb: train/learning_rate 0.0
wandb: train/loss 33.4141
wandb: train/total_flos 1165735527186432.0
wandb: train/train_loss 33.22997
wandb: train/train_runtime 925.8567
wandb: train/train_samples_per_second 2.887
wandb: train/train_steps_per_second 0.042
At the same time, there seem incomplete weight in the output folder:
Only three file there:adapter_config.json, adapter_model.bin and trainer_state.json, and the size of adapter_model.bin only 8M.
Campare to the outputs of train_mem.py, there are 26G files and weights, the results seem not correct.
So, Can anyone tell me that the output of the three files is correct? If so, how should I use the adapter_model. bin weight?
I'm closing this issue because this seems to be a flash attention issue.
We'll soon migrate to use the xformer (https://github.com/facebookresearch/xformers) in place of flashattention, as our internal tests show they have similar memory/compute performance, but xformer is much more stable, maintained by Meta, supports more types of GPUs, and is more extensible.
@zhisbug I think if the official can run with flash-attention in A100(same hardware). Maybe providing a environment about particular version can help others to solve the confusions.
I successfully resolved the error and completed the training, but the result seems incorrect.
I'm facing similar issue
Expected q_dtype == torch::kFloat16 || ((is_sm8x || is_sm90) && q_dtype == torch::kBFloat16) to be true, but got falseHere's the details:
The error occured on the file
lib/python3.10/site-pakacges/flash_attn/flash_attn_interface.py:21softmax_lse, rng_state, *rest = flash_attn_cuda.fwd( q, k, v, out, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k, dropout_p, softmax_scale, False, causal, return_softmax, num_splits, generator )where the
flash_attn_cudaseems a compiled .so file that unable to view content. After debug, theq.dtype,k.dtypeandv.dtypeare alltorch.float32.The error infomation shows that the
q.dtypemust betorch.float16ortorch.bf16(if the compute capacity is grater than 8x or 90). (Note: the compute capacity of A100 is sm8, that of 40 series GPU are sm89, which can be found here (nvidia)).Hence, it seems the error occurred due to flash-attn not supporting tf32.
In the train.sh script, I'v set
--fp16 True,--fp16_full_eval Trueandhalf_precision_backend "cuda_amp"to suppose the model and inputs can be convert tofloat16automatically, but It does not work. (Are there any experts here who can provide guidance on how to set up the through Huggingface args?To solve this manually, I add the following line in the file
site-packages/transformers/trainer.py:Line 2695: >>> model = model.half()Another error occurred because the variable generated in the middle would become
tf32, which does not match the model (fp16). I made the corresponding modification (but forgot where to make the changes)Then, the model can be successfully trained! (I used the
deepspeedcommand, thetrain_lora.py, close theFSDP, allowedflash-attn)Seems incorrect result
After trained, the train loss seems too much bigger:
wandb: Run summary: wandb: train/epoch 2.8 wandb: train/global_step 39 wandb: train/learning_rate 0.0 wandb: train/loss 33.4141 wandb: train/total_flos 1165735527186432.0 wandb: train/train_loss 33.22997 wandb: train/train_runtime 925.8567 wandb: train/train_samples_per_second 2.887 wandb: train/train_steps_per_second 0.042At the same time, there seem incomplete weight in the
outputfolder:Only three file there:
adapter_config.json,adapter_model.binandtrainer_state.json, and the size ofadapter_model.binonly8M. Campare to the outputs oftrain_mem.py, there are26Gfiles and weights, the results seem not correct.So, Can anyone tell me that the output of the three files is correct? If so, how should I use the
adapter_model. binweight?
Were you able to solve this issue? I'm also facing the same problem in which my loss is so high, which seems incorrect
LLaMA 7b has 128 dim for each head, while flash attn support 64 dim for rtx3090 only. So llama 7b with flash attn may only run on a100 or h100.
Are you sure? flash-attn v2 supports dim up to 256. I am able to use it on 3090
FlashAttention-2 currently supports:
- Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100). Support for Turing GPUs (T4, RTX 2080) is coming soon, please use FlashAttention 1.x for Turing GPUs for now.
- Datatype fp16 and bf16 (bf16 requires Ampere, Ada, or Hopper GPUs).
- All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800.
Try #2126
Are you sure? flash-attn v2 supports dim up to 256. I am able to use it on 3090
FlashAttention-2 currently supports:
- Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100). Support for Turing GPUs (T4, RTX 2080) is coming soon, please use FlashAttention 1.x for Turing GPUs for now.
- Datatype fp16 and bf16 (bf16 requires Ampere, Ada, or Hopper GPUs).
- All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800.
Try #2126
Yes, thank you. I formerly use flash attention 1.x. Now 2.0 supports.