FastChat Expected is_sm80 to be true, but got false.

Hi there,

I am trying to fine tune vicuna-7b with 2 GTX 3090 cards.

torchrun --nnodes=1 --nproc_per_node=2 \
    fastchat/train/train_mem.py \
    --model_name_or_path vicuna-7b \
    --data_path playground/data/alpaca-data-conversation.json \
    --bf16 True \
    --output_dir ./checkpoints \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 100 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True

However, the train_mem.py tells "Expected is_sm80 to be true, but got false. "

I have tried (nightly torch) but gotten the same error.

Are there any fixes on this issue? Thanks.

BTW: I also tried fastchat/train/train.py and gotten an OOM error.

Apr 08 '23 08:04 dstsmallbird

need A100/H100 with train_mem.py, but train with lora (ref to the train_lora.py)is fine.

Apr 08 '23 10:04 better629

need A100/H100 with train_mem.py, but train with lora (ref to the train_lora.py)is fine.

Thanks. I also tried lora, and got another error. It seems that I used it in a wrong way. Could you please send me some documentation on how to use lora to fine tune vicuna-7b? Thanks.

Apr 08 '23 10:04 dstsmallbird

Encountering the same problem. It seems to be some issue with flash attention however the sample in the flash attention repo runs fine. I notice it clears one stage if remove the bfloat flag and fp32 flag and add fp16. In any case, you should provide your full trace.

Apr 09 '23 02:04 kungfu-eric

Encountering the same problem.

Apr 09 '23 02:04 xsun15

I think the reason the training sample in flash attention runs correctly is that it's using a fused softmax. We can verify this when adding a print statement to the backward pass – FastChat will trigger the print but not the flash attention gpt2 trainer script. Unfortunately the flash attention trainer script has a bunch of lighting hydra wrappers so it's not trivial to trace it. Further, probably not a straight subtitutoin to go to the fused ops from the more vanilla flash attention...

Edit: found the fused ops in the flash attn GPT model file here: https://github.com/HazyResearch/flash-attention/blob/393882bc089aff64863c260a8633b8390c90aa78/flash_attn/models/gpt.py Lots of if config commands to pull in the fused ops. Theoretically, merging the changes into the llama model file could both make the training more efficient and circumvent this is_sm80 error

Edit2: Nah it's a different part of flash attention for the two implementations. It should be https://github.com/HazyResearch/flash-attention/issues/138 this issue is the problem Basically the shared mem on a6000 is too small compared to a100 so head dim needs to be reduced. Hard hardware issue...

Apr 09 '23 02:04 kungfu-eric

I think the reason the training sample in flash attention runs correctly is that it's using a fused softmax. We can verify this when adding a print statement to the backward pass – FastChat will trigger the print but not the flash attention gpt2 trainer script. Unfortunately the flash attention trainer script has a bunch of lighting hydra wrappers so it's not trivial to trace it. Further, probably not a straight subtitutoin to go to the fused ops from the more vanilla flash attention...

Edit: found the fused ops in the flash attn GPT model file here: https://github.com/HazyResearch/flash-attention/blob/393882bc089aff64863c260a8633b8390c90aa78/flash_attn/models/gpt.py Lots of if config commands to pull in the fused ops. Theoretically, merging the changes into the llama model file could both make the training more efficient and circumvent this is_sm80 error

Edit2: Nah it's a different part of flash attention for the two implementations. It should be HazyResearch/flash-attention#138 this issue is the problem Basically the shared mem on a6000 is too small compared to a100 so head dim needs to be reduced. Hard hardware issue...

Thanks for your reply and sorry for incomplete trace. I tried to run this on A100 and it works well. I am not sure if the authors have any plan to fix this (if this can be fixed).

Apr 09 '23 14:04 dstsmallbird

A duplicated issue with #459 . It seems that flashattention has some compatibility issue with commodity GPUs. Closing this one and let's discuss in #459 instead.

Apr 21 '23 02:04 zhisbug

need A100/H100 with train_mem.py, but train with lora (ref to the train_lora.py)is fine.

Can you please provide the command line statement which you used to train vicuna lora?

Apr 25 '23 14:04 samarthsarin

Has anyone trained vicuna with a modified version of train_mem.py to accommodate A6000 GPU? On the flash-attention repo [1], Tri Dao mentioned that you can use A6000 with reduced headdim, so I'm wondering if anyone has a modified version of FastChat or knows where headdim is specified so that I can train using A6000. Thanks!

[1] https://github.com/HazyResearch/flash-attention/issues/138#issuecomment-1466837552 It seems that soon the A100 requirement will be removed, not sure when though.

May 12 '23 19:05 rmovva

FastChat FastChat copied to clipboard

Expected is_sm80 to be true, but got false.

FastChat
FastChat copied to clipboard