LLaMA-Factory
LLaMA-Factory copied to clipboard
fix MoD related stuff
What does this PR do?
Everything had been tested with tiny models it should work out of the box even with bigger model I've modified the freeze module cycle to include a MoD layer in the choices Be sure to update the MoD package before testing it
I still cannot fine-tune the models normally with the latest version of MoD
With gemma-2b, it raises:
File "lib/python3.11/site-packages/torch/utils/checkpoint.py", line 482, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/autograd/function.py", line 553, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/utils/checkpoint.py", line 261, in forward
outputs = run_function(*args)
^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/MoD/MoD.py", line 94, in forward
return (output,cache) if cache_position is not None else (output,)
^^^^^^
UnboundLocalError: cannot access local variable 'cache' where it is not associated with a value
With qwen1.5-0.5b, it raises:
File "lib/python3.11/site-packages/torch/utils/checkpoint.py", line 482, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/autograd/function.py", line 553, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/utils/checkpoint.py", line 261, in forward
outputs = run_function(*args)
^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/MoD/MoD.py", line 91, in forward
)[0] * weights[i][selected_mask[i]].unsqueeze(-1)
~~~~~~~~~~^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
I changed few lines in MoD and now the llama-2 7B can be fine-tuned normally: https://www.diffchecker.com/uqDWakZi/
but the grad norm seems to be unexpectedly large.
(not sure, probably due to the hyper params)
cool thx for the check. as for the gradients I think it is quite normal since we are inserting a new module in a pretrained model
The qwen error is quite unusual and I think that the error might be carried form other parts of the code, I'll investigate in this days
I haven't dived into the implementation details, it just a on-the-fly fix i am using transformers 4.40.0.dev0 now
Thx I'll do some tests tomorrow
Just pushed a new verison(1.1.6). I've tested it and the training seems to go well (the gradient stabilizes in about 10/20 steps, granted that max_norm is enabled) Both Gemma and Qwen issues have been addressed
I haven't dived into the implementation details, it just a on-the-fly fix i am using transformers 4.40.0.dev0 now
大佬,你的4.40.0.dev0是在哪儿安装的,我看最新的版本也才4.39.3。是不是装了这个就可以规避掉这个问题了,我用Qwen1.5-72B 在FSDP+Qlora也遇到同样的问题,我的Transfomer的版本是4.39.1。
I haven't dived into the implementation details, it just a on-the-fly fix i am using transformers 4.40.0.dev0 now
大佬,你的4.40.0.dev0是在哪儿安装的,我看最新的版本也才4.39.3。是不是装了这个就可以规避掉这个问题了,我用Qwen1.5-72B 在FSDP+Qlora也遇到同样的问题,我的Transfomer的版本是4.39.1。
You can install it from souce by doing pip install git+https://github.com/huggingface/transformers Both 3.39 and 3.40 are tested
多谢大佬,明天测试了反馈看是不是这个问题
---Original--- From: @.> Date: Tue, Apr 16, 2024 22:29 PM To: @.>; Cc: @.@.>; Subject: Re: [hiyouga/LLaMA-Factory] fix MoD related stuff (PR #3288)
I haven't dived into the implementation details, it just a on-the-fly fix i am using transformers 4.40.0.dev0 now
大佬,你的4.40.0.dev0是在哪儿安装的,我看最新的版本也才4.39.3。是不是装了这个就可以规避掉这个问题了,我用Qwen1.5-72B 在FSDP+Qlora也遇到同样的问题,我的Transfomer的版本是4.39.1。
You can install it from souce by doing pip install git+https://github.com/huggingface/transformers Both 3.39 and 3.40 are tested
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
pip install git+https://github.com/huggingface/transformers
已经测试了,将transfomer升级到4.40.0.dev0,依然解决不了问题。
...
tqdm 4.66.2
transformers 4.40.0.dev0
triton 2.1.0
...
同样的问题依然发生:
**_CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
0%| | 0/2025 [00:06<?, ?it/s]
[2024-04-17 02:45:21,299] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 730 closing signal SIGTERM
[2024-04-17 02:45:51,299] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 730 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-04-17 02:46:00,688] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 731) of binary: /usr/bin/python3.10
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
Just pushed a new verison(1.1.6). I've tested it and the training seems to go well (the gradient stabilizes in about 10/20 steps, granted that max_norm is enabled) Both Gemma and Qwen issues have been addressed
Dear friend, could you tell me how you solved the issue with Qwen1.5? What specific steps did you take?
Nothing in particular, it had gone away with 1.1.6. What hardware are you running it from? There are many possible causes of a CUDA Error of this type. Can you please launch a run with CUDA_LAUNCH_BLOCKING=1 and TORCH_DISTRIBUTED_DEBUG=1 ?
您说的1.1.6指的是什么组件?我用的是2*3090,在架构上应该是可以支持的
---Original--- From: @.> Date: Wed, Apr 17, 2024 16:41 PM To: @.>; Cc: @.@.>; Subject: Re: [hiyouga/LLaMA-Factory] fix MoD related stuff (PR #3288)
Nothing in particular, it had gone away with 1.1.6. What hardware are you running it from? There are many possible causes of a CUDA Error of this type. Can you please launch a run with CUDA_LAUNCH_BLOCKING=1 and TORCH_DISTRIBUTED_DEBUG=1 ?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
Could you turn on the button to allow edits? I need to resolve the conflicts.
Btw, I find that there still exists some problem with SdpaAttention or FlashAttention in MoD implementation
Seems pretty strange, both of them works on my end, what models arr you using? What gpu do you use?
Could you turn on the button to allow edits? I need to resolve the conflicts.
Btw, I find that there still exists some problem with SdpaAttention or FlashAttention in MoD implementation
I don't actually have that option
qwen1.5+flashattn2 on A100 gpu
awesome thanks i'll try to debug it
Could you turn on the button to allow edits? I need to resolve the conflicts.
Btw, I find that there still exists some problem with SdpaAttention or FlashAttention in MoD implementation
I don't actually have that option
or could you please rebase this pr on the latest main branch?
No problem, I'll do it tomorrow. Alse please send the error it gives you, I can't manage to reproduce it
It should be on par now
seems that it still have conflicts with the main branch, I think you can open a new pr based on it
As for the flash attn problem, we can fix it after merging the pr for the mod feature, since it works for most cases
ping @mlinmg
No problem, I'll do it tomorrow. Alse please send the error it gives you, I can't manage to reproduce it
script:
CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
--stage sft \
--do_train \
--model_name_or_path qwen1_5-0_5b-chat \
--dataset alpaca_gpt4_en \
--report_to none \
--template default \
--finetuning_type full \
--mixture_of_depths convert \
--output_dir mod/sft \
--overwrite_cache \
--overwrite_output_dir \
--cutoff_len 1024 \
--preprocessing_num_workers 16 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--optim adamw_8bit \
--lr_scheduler_type cosine \
--logging_steps 10 \
--warmup_steps 20 \
--save_steps 100 \
--eval_steps 100 \
--evaluation_strategy steps \
--load_best_model_at_end \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--max_samples 3000 \
--val_size 0.1 \
--plot_loss \
--pure_bf16
trace back:
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [5,0,0], thread: [0,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [5,0,0], thread: [1,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
File "src/train_bash.py", line 14, in <module>
main()
File "src/train_bash.py", line 5, in main
run_exp()
File "src/llmtuner/train/tuner.py", line 33, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "src/llmtuner/train/sft/workflow.py", line 71, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/transformers/trainer.py", line 1858, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/transformers/trainer.py", line 2202, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/transformers/trainer.py", line 3137, in training_step
loss = self.compute_loss(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/transformers/trainer.py", line 3160, in compute_loss
outputs = model(**inputs)
^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1168, in forward
outputs = self.model(
^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1043, in forward
layer_outputs = self._gradient_checkpointing_func(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "src/llmtuner/model/utils.py", line 131, in custom_gradient_checkpointing_func
return gradient_checkpointing_func(func, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/_compile.py", line 24, in inner
return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/utils/checkpoint.py", line 482, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/autograd/function.py", line 553, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/utils/checkpoint.py", line 261, in forward
outputs = run_function(*args)
^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/MoD/MoD.py", line 86, in forward
processed_tokens[i][selected_mask[i]] = self.block(
^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 768, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 283, in forward
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
cool I'll rebase it now
too many dummy commit in this pr, open another one should be more efficient
btw, qwen1.5 models do not work for me in both non-FA2 and FA2 paths
I'm using the TinyLlama-1.1B-Chat-v1.0 model, but soon the gradient will become nan.
I think the problem here is that since the mod router is trying to adapt to a pretrained model, the gradients are very high and the weights are changed too much, you can try to set grad norm to something like 0.5/0.8 to see if it helps