FastChat Fine-tuning Vicuna-7B with Local GPUs: RuntimeError: Expected is_sm80 || is

RuntimeError: Expected is_sm80 || is_sm90 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)

0%| | 0/3096 [00:00<?, ?it/s] use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False...

Apr 17 '23 06:04 eeric

I think this is a problem with Pytorch. I solved it by installing PyTorch with cuda 11.8 (not 12). if you have cuda 12, the flash-attention package installing problem will come as pytorch is not compiled with 12.0 cuda

Apr 17 '23 06:04 Ejafa

@Ejafa your pytorch version? my version: 2.0.0+cu117 besides, wandb: False

Apr 17 '23 07:04 eeric

I am working on fine-tuning. I will keep updated

Apr 17 '23 12:04 Ejafa

I can run 7B-vicuna on my Windows computer with RTX3090 and the performance is good. I am trying to fine-tune it with my own data, but I encountered the following error: RuntimeError: Distributed package doesn't have NCCL built in.

Apr 17 '23 15:04 superouterman

https://github.com/lm-sys/FastChat/issues/294 https://github.com/johnsmith0031/alpaca_lora_4bit/issues/62 https://github.com/pytorch/pytorch/issues/94883

it is estimated that A100/H100 GPU card will be used

Apr 18 '23 05:04 eeric

#294 johnsmith0031/alpaca_lora_4bit#62 pytorch/pytorch#94883

it is estimated that A100/H100 GPU card will be used

do you solve this problem?

Apr 20 '23 08:04 NovasWang

Any solution for this error? Expected is_sm80 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)

But when I turned off gradient_checkpointing = False in Trainer then a new error came AssertionError: use_cache is not supported

Apr 21 '23 15:04 samarthsarin

I'm facing similar issue

Expected q_dtype == torch::kFloat16 || ((is_sm8x || is_sm90) && q_dtype == torch::kBFloat16) to be true, but got false

Apr 21 '23 15:04 dittops

@Ejafa were you able to fine tune the model on local GPU?

Apr 22 '23 04:04 samarthsarin

@eeric did you solve this error? my pytorch version is 1.13.1+cuda11.7 . I got the same error .

Apr 23 '23 01:04 moseshu

I got the same error as @dittops. I find this is an error in flash-attention. For now, I have to comment replace_llama_attn_with_flash_attn() to train with lora.

RuntimeError: Expected q_dtype == torch::kFloat16 || ((is_sm8x || is_sm90) && q_dtype == torch::kBFloat16) to be true, but got false.  (Could this error message be improved?  If so,
please report an enhancement request to PyTorch.)

My environment is: PyTorch 2.0.0 cuda 11.7 A100 80G

Apr 24 '23 07:04 zhengzangw

also same error with cuda 11.6 on A100 80G

Apr 24 '23 09:04 sersh88

I'm looking into it. It seem to be an issue with flash attention and not vicuna/fastchat as such

Apr 24 '23 11:04 StrangeTcy

If I replace bf16 True with fp16 True in the script args and also add "fp16": {"enabled": true} to my deepspeed config, the error changes to RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn, and the relevant part of the traceback is

─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮                                                                                                       
│ /root/FastChat/fastchat/train/train_lora_llama.py:163 in <module>                                │                                                                                                       
│                                                                                                  │                                                                                                       
│   160                                                                                            │                                                                                                       
│   161                                                                                            │                                                                                                       
│   162 if __name__ == "__main__":                                                                 │                                                                                                       
│ ❱ 163 │   train()                                                                                │                                                                                                       
│   164                                                                                            │                                                                                                       
│                                                                                                  │                                                                                                       
│ /root/FastChat/fastchat/train/train_lora_llama.py:153 in train                                   │                                                                                                       
│                                                                                                  │                                                                                                       
│   150 │   if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")):                  │                                                                                                       
│   151 │   │   trainer.train(resume_from_checkpoint=True)                                         │                                                                                                       
│   152 │   else:                                                                                  │                                                                                                       
│ ❱ 153 │   │   trainer.train()                                                                    │                                                                                                       
│   154 │   trainer.save_state()                                                                   │                                                                                                       
│   155 │                                                                                          │                                                                                                       
│   156 │   # Save states. Weights might be a placeholder in zero3 and need a gather               │                                                                                                       
│                                                                                                  │                                                                                                       
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/transformers/trainer.py:1662 in train │                                                                                                       
│                                                                                                  │                                                                                                       
│   1659 │   │   inner_training_loop = find_executable_batch_size(                                 │                                                                                                       
│   1660 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │                                                                                                       
│   1661 │   │   )                                                                                 │                                                                                                       
│ ❱ 1662 │   │   return inner_training_loop(                                                       │                                                                                                       
│   1663 │   │   │   args=args,                                                                    │                                                                                                       
│   1664 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │                                                                                                       
│   1665 │   │   │   trial=trial,                                                                  │                                                                                                       
│                                                                                                  │                                                                                                       
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/transformers/trainer.py:1929 in       │                                                                                                       
│ _inner_training_loop                                                                             │                                                                                                       
│                                                                                                  │                                                                                                       
│   1926 │   │   │   │   │   with model.no_sync():                                                 │                                                                                                       
│   1927 │   │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                  │                                                                                                       
│   1928 │   │   │   │   else:                                                                     │                                                                                                       
│ ❱ 1929 │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                      │                                                                                                       
│   1930 │   │   │   │                                                                             │                                                                                                       
│   1931 │   │   │   │   if (                                                                      │                                                                                                       
│   1932 │   │   │   │   │   args.logging_nan_inf_filter                                           │                                                                                                       
│                                                                                                  │                                                                                                       
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/transformers/trainer.py:2715 in       │                                                                                                       
│ training_step                                                                                    │                                                                                                       
│                                                                                                  │                                                                                                       
│   2712 │   │   │   │   scaled_loss.backward()                                                    │                                                                                                       
│   2713 │   │   elif self.deepspeed:                                                              │                                                                                                       
│   2714 │   │   │   # loss gets scaled under gradient_accumulation_steps in deepspeed             │                                                                                                       
│ ❱ 2715 │   │   │   loss = self.deepspeed.backward(loss)                                          │                                                                                                       
│   2716 │   │   else:                                                                             │                                                                                                       
│   2717 │   │   │   loss.backward()                                                               │                                                                                                       
│   2718                                                                                           │                                                                                                             │
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/deepspeed/utils/nvtx.py:15 in         │
│ wrapped_fn                                                                                       │
│                                                                                                  │
│   12 │                                                                                           │
│   13 │   def wrapped_fn(*args, **kwargs):                                                        │
│   14 │   │   get_accelerator().range_push(func.__qualname__)                                     │
│ ❱ 15 │   │   ret_val = func(*args, **kwargs)                                                     │
│   16 │   │   get_accelerator().range_pop()                                                       │
│   17 │   │   return ret_val                                                                      │
│   18                                                                                             │
│                                                                                                  │
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/deepspeed/runtime/engine.py:1796 in   │
│ backward                                                                                         │
│                                                                                                  │
│   1793 │   │                                                                                     │
│   1794 │   │   if self.zero_optimization():                                                      │
│   1795 │   │   │   self.optimizer.is_gradient_accumulation_boundary = self.is_gradient_accumula  │
│ ❱ 1796 │   │   │   self.optimizer.backward(loss, retain_graph=retain_graph)                      │
│   1797 │   │   elif self.amp_enabled():                                                          │
│   1798 │   │   │   # AMP requires delaying unscale when inside gradient accumulation boundaries  │
│   1799 │   │   │   # https://nvidia.github.io/apex/advanced.html#gradient-accumulation-across-i  │
│                                                                                                  │
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2. │
│ py:1890 in backward                                                                              │
│                                                                                                  │
│   1887 │   │   │   scaled_loss = self.external_loss_scale * loss                                 │
│   1888 │   │   │   scaled_loss.backward()                                                        │
│   1889 │   │   else:                                                                             │
│ ❱ 1890 │   │   │   self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)            │
│   1891 │                                                                                         │
│   1892 │   def check_overflow(self, partition_gradients=True):                                   │
│   1893 │   │   self._check_overflow(partition_gradients)                                         │
│                                                                                                  │
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py │
│ :62 in backward                                                                                  │
│                                                                                                  │
│    59 │                                                                                          │
│    60 │   def backward(self, loss, retain_graph=False):                                          │
│    61 │   │   scaled_loss = loss * self.loss_scale                                               │
│ ❱  62 │   │   scaled_loss.backward(retain_graph=retain_graph)                                    │
│    63 │   │   # print(f'LossScalerBackward: {scaled_loss=}')                                     │
│    64                                                                                            │
│    65                                                                                            │
│                                                                                                  │
                                                                                                  │
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/torch/_tensor.py:487 in backward      │
│                                                                                                  │
│    484 │   │   │   │   create_graph=create_graph,                                                │
│    485 │   │   │   │   inputs=inputs,                                                            │
│    486 │   │   │   )                                                                             │
│ ❱  487 │   │   torch.autograd.backward(                                                          │
│    488 │   │   │   self, gradient, retain_graph, create_graph, inputs=inputs                     │
│    489 │   │   )                                                                                 │
│    490                                                                                           │
│                                                                                                  │
│ /root/anaconda3/envs/fastchat/lib/python3.10/site-packages/torch/autograd/__init__.py:200 in     │
│ backward                                                                                         │
│                                                                                                  │
│   197 │   # The reason we repeat same the comment below is that                                  │
│   198 │   # some Python versions print out the first line of a multi-line function               │
│   199 │   # calls in the traceback and some print out the last line                              │
│ ❱ 200 │   Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the bac   │
│   201 │   │   tensors, grad_tensors_, retain_graph, create_graph, inputs,                        │
│   202 │   │   allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to ru   │
│   203                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

Apr 24 '23 11:04 StrangeTcy

@StrangeTcy did you get any solution?

Apr 25 '23 14:04 samarthsarin

Bad new! confirmed by flash_attn author, it is impossible to train llama or vicuna using flash_attn with RTX3090 sm86. here is the link to flash_attn author's reply https://github.com/HazyResearch/flash-attention/issues/190#issuecomment-1523375324

Apr 26 '23 13:04 ericzhou571

I'm facing similar issue

Expected q_dtype == torch::kFloat16 || ((is_sm8x || is_sm90) && q_dtype == torch::kBFloat16) to be true, but got false

@dittops @zhengzangw Do you solve the problem using train_lora with flash_attn.

Apr 29 '23 14:04 better629

I successfully resolved the error and completed the training, but the result seems incorrect.

I'm facing similar issue

Expected q_dtype == torch::kFloat16 || ((is_sm8x || is_sm90) && q_dtype == torch::kBFloat16) to be true, but got false

Here's the details:

The error occured on the file lib/python3.10/site-pakacges/flash_attn/flash_attn_interface.py:21

softmax_lse, rng_state, *rest = flash_attn_cuda.fwd(
        q, k, v, out, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k, dropout_p,
        softmax_scale, False, causal, return_softmax, num_splits, generator
    )

where the flash_attn_cuda seems a compiled .so file that unable to view content. After debug, the q.dtype, k.dtype and v.dtype are all torch.float32.

The error infomation shows that the q.dtype must be torch.float16 or torch.bf16 (if the compute capacity is grater than 8x or 90). (Note: the compute capacity of A100 is sm8, that of 40 series GPU are sm89, which can be found here (nvidia)).

Hence, it seems the error occurred due to flash-attn not supporting tf32.

In the train.sh script, I'v set --fp16 True, --fp16_full_eval True and half_precision_backend "cuda_amp" to suppose the model and inputs can be convert to float16 automatically, but It does not work. (Are there any experts here who can provide guidance on how to set up the through Huggingface args?

To solve this manually, I add the following line in the file site-packages/transformers/trainer.py:

Line 2695: >>> model = model.half()

Another error occurred because the variable generated in the middle would become tf32, which does not match the model (fp16). I made the corresponding modification (but forgot where to make the changes)

Then, the model can be successfully trained! (I used the deepspeed command, the train_lora.py, close the FSDP, allowed flash-attn)

Seems incorrect result

After trained, the train loss seems too much bigger:

wandb: Run summary:
wandb:                    train/epoch 2.8
wandb:              train/global_step 39
wandb:            train/learning_rate 0.0
wandb:                     train/loss 33.4141
wandb:               train/total_flos 1165735527186432.0
wandb:               train/train_loss 33.22997
wandb:            train/train_runtime 925.8567
wandb: train/train_samples_per_second 2.887
wandb:   train/train_steps_per_second 0.042

At the same time, there seem incomplete weight in the output folder:

Only three file there:adapter_config.json, adapter_model.bin and trainer_state.json, and the size of adapter_model.bin only 8M. Campare to the outputs of train_mem.py, there are 26G files and weights, the results seem not correct.

So, Can anyone tell me that the output of the three files is correct? If so, how should I use the adapter_model. bin weight?

May 03 '23 02:05 zhangzhengde0225

I'm closing this issue because this seems to be a flash attention issue.

We'll soon migrate to use the xformer (https://github.com/facebookresearch/xformers) in place of flashattention, as our internal tests show they have similar memory/compute performance, but xformer is much more stable, maintained by Meta, supports more types of GPUs, and is more extensible.

May 08 '23 08:05 zhisbug

@zhisbug I think if the official can run with flash-attention in A100(same hardware). Maybe providing a environment about particular version can help others to solve the confusions.

May 08 '23 14:05 better629

I successfully resolved the error and completed the training, but the result seems incorrect.

I'm facing similar issue Expected q_dtype == torch::kFloat16 || ((is_sm8x || is_sm90) && q_dtype == torch::kBFloat16) to be true, but got false

Here's the details:

The error occured on the file lib/python3.10/site-pakacges/flash_attn/flash_attn_interface.py:21
softmax_lse, rng_state, *rest = flash_attn_cuda.fwd(
        q, k, v, out, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k, dropout_p,
        softmax_scale, False, causal, return_softmax, num_splits, generator
    )
where the flash_attn_cuda seems a compiled .so file that unable to view content. After debug, the q.dtype, k.dtype and v.dtype are all torch.float32.

The error infomation shows that the q.dtype must be torch.float16 or torch.bf16 (if the compute capacity is grater than 8x or 90). (Note: the compute capacity of A100 is sm8, that of 40 series GPU are sm89, which can be found here (nvidia)).

Hence, it seems the error occurred due to flash-attn not supporting tf32.

In the train.sh script, I'v set --fp16 True, --fp16_full_eval True and half_precision_backend "cuda_amp" to suppose the model and inputs can be convert to float16 automatically, but It does not work. (Are there any experts here who can provide guidance on how to set up the through Huggingface args?

To solve this manually, I add the following line in the file site-packages/transformers/trainer.py:
Line 2695: >>> model = model.half()
Another error occurred because the variable generated in the middle would become tf32, which does not match the model (fp16). I made the corresponding modification (but forgot where to make the changes)

Then, the model can be successfully trained! (I used the deepspeed command, the train_lora.py, close the FSDP, allowed flash-attn)

Seems incorrect result

After trained, the train loss seems too much bigger:
wandb: Run summary:
wandb:                    train/epoch 2.8
wandb:              train/global_step 39
wandb:            train/learning_rate 0.0
wandb:                     train/loss 33.4141
wandb:               train/total_flos 1165735527186432.0
wandb:               train/train_loss 33.22997
wandb:            train/train_runtime 925.8567
wandb: train/train_samples_per_second 2.887
wandb:   train/train_steps_per_second 0.042
At the same time, there seem incomplete weight in the output folder:

Only three file there:adapter_config.json, adapter_model.bin and trainer_state.json, and the size of adapter_model.bin only 8M. Campare to the outputs of train_mem.py, there are 26G files and weights, the results seem not correct.

So, Can anyone tell me that the output of the three files is correct? If so, how should I use the adapter_model. bin weight?

Were you able to solve this issue? I'm also facing the same problem in which my loss is so high, which seems incorrect

Jul 01 '23 17:07 mvuthegoat

LLaMA 7b has 128 dim for each head, while flash attn support 64 dim for rtx3090 only. So llama 7b with flash attn may only run on a100 or h100.

Jul 29 '23 06:07 John-Ge

Are you sure? flash-attn v2 supports dim up to 256. I am able to use it on 3090

FlashAttention-2 currently supports:

Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100). Support for Turing GPUs (T4, RTX 2080) is coming soon, please use FlashAttention 1.x for Turing GPUs for now.

Datatype fp16 and bf16 (bf16 requires Ampere, Ada, or Hopper GPUs).

All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800.

Try #2126

Aug 01 '23 05:08 tmm1

Are you sure? flash-attn v2 supports dim up to 256. I am able to use it on 3090

FlashAttention-2 currently supports:

Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100). Support for Turing GPUs (T4, RTX 2080) is coming soon, please use FlashAttention 1.x for Turing GPUs for now.

Datatype fp16 and bf16 (bf16 requires Ampere, Ada, or Hopper GPUs).

All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800.

Try #2126

Yes, thank you. I formerly use flash attention 1.x. Now 2.0 supports.

Aug 01 '23 06:08 John-Ge

Fine-tuning Vicuna-7B with Local GPUs: RuntimeError: Expected is_sm80 || is_sm90 to be true, but got false

Here's the details:

Seems incorrect result

Here's the details:

Seems incorrect result