What does this PR do?

Everything had been tested with tiny models it should work out of the box even with bigger model I've modified the freeze module cycle to include a MoD layer in the choices Be sure to update the MoD package before testing it

Apr 15 '24 19:04 mlinmg

I still cannot fine-tune the models normally with the latest version of MoD

With gemma-2b, it raises:

File "lib/python3.11/site-packages/torch/utils/checkpoint.py", line 482, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/utils/checkpoint.py", line 261, in forward
    outputs = run_function(*args)
              ^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/MoD/MoD.py", line 94, in forward
    return (output,cache) if cache_position is not None else (output,)
            ^^^^^^
UnboundLocalError: cannot access local variable 'cache' where it is not associated with a value

With qwen1.5-0.5b, it raises:

File "lib/python3.11/site-packages/torch/utils/checkpoint.py", line 482, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/utils/checkpoint.py", line 261, in forward
    outputs = run_function(*args)
              ^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/MoD/MoD.py", line 91, in forward
    )[0] * weights[i][selected_mask[i]].unsqueeze(-1)
           ~~~~~~~~~~^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Apr 15 '24 19:04 hiyouga

I changed few lines in MoD and now the llama-2 7B can be fine-tuned normally: https://www.diffchecker.com/uqDWakZi/ but the grad norm seems to be unexpectedly large.

(not sure, probably due to the hyper params)

Apr 15 '24 19:04 hiyouga

cool thx for the check. as for the gradients I think it is quite normal since we are inserting a new module in a pretrained model

Apr 15 '24 20:04 mlinmg

The qwen error is quite unusual and I think that the error might be carried form other parts of the code, I'll investigate in this days

Apr 15 '24 20:04 mlinmg

I haven't dived into the implementation details, it just a on-the-fly fix i am using transformers 4.40.0.dev0 now

Apr 15 '24 20:04 hiyouga

Thx I'll do some tests tomorrow

Apr 15 '24 20:04 mlinmg

Just pushed a new verison(1.1.6). I've tested it and the training seems to go well (the gradient stabilizes in about 10/20 steps, granted that max_norm is enabled) Both Gemma and Qwen issues have been addressed

Apr 16 '24 14:04 mlinmg

I haven't dived into the implementation details, it just a on-the-fly fix i am using transformers 4.40.0.dev0 now

大佬，你的4.40.0.dev0是在哪儿安装的，我看最新的版本也才4.39.3。是不是装了这个就可以规避掉这个问题了，我用Qwen1.5-72B 在FSDP+Qlora也遇到同样的问题，我的Transfomer的版本是4.39.1。

Apr 16 '24 14:04 camposs1979

I haven't dived into the implementation details, it just a on-the-fly fix i am using transformers 4.40.0.dev0 now

大佬，你的4.40.0.dev0是在哪儿安装的，我看最新的版本也才4.39.3。是不是装了这个就可以规避掉这个问题了，我用Qwen1.5-72B 在FSDP+Qlora也遇到同样的问题，我的Transfomer的版本是4.39.1。

You can install it from souce by doing pip install git+https://github.com/huggingface/transformers Both 3.39 and 3.40 are tested

Apr 16 '24 14:04 mlinmg

多谢大佬，明天测试了反馈看是不是这个问题

---Original--- From: @.> Date: Tue, Apr 16, 2024 22:29 PM To: @.>; Cc: @.@.>; Subject: Re: [hiyouga/LLaMA-Factory] fix MoD related stuff (PR #3288)

I haven't dived into the implementation details, it just a on-the-fly fix i am using transformers 4.40.0.dev0 now

大佬，你的4.40.0.dev0是在哪儿安装的，我看最新的版本也才4.39.3。是不是装了这个就可以规避掉这个问题了，我用Qwen1.5-72B 在FSDP+Qlora也遇到同样的问题，我的Transfomer的版本是4.39.1。

You can install it from souce by doing pip install git+https://github.com/huggingface/transformers Both 3.39 and 3.40 are tested

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Apr 16 '24 14:04 camposs1979

pip install git+https://github.com/huggingface/transformers

已经测试了，将transfomer升级到4.40.0.dev0，依然解决不了问题。 ... tqdm 4.66.2 transformers 4.40.0.dev0 triton 2.1.0 ... 同样的问题依然发生： **_CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

0%| | 0/2025 [00:06<?, ?it/s] [2024-04-17 02:45:21,299] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 730 closing signal SIGTERM [2024-04-17 02:45:51,299] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 730 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL [2024-04-17 02:46:00,688] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 731) of binary: /usr/bin/python3.10 Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1044, in launch_command multi_gpu_launcher(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 702, in multi_gpu_launcher distrib_run.run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self.entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:**

Apr 17 '24 02:04 camposs1979

Just pushed a new verison(1.1.6). I've tested it and the training seems to go well (the gradient stabilizes in about 10/20 steps, granted that max_norm is enabled) Both Gemma and Qwen issues have been addressed

Dear friend, could you tell me how you solved the issue with Qwen1.5? What specific steps did you take?

Apr 17 '24 03:04 camposs1979

Nothing in particular, it had gone away with 1.1.6. What hardware are you running it from? There are many possible causes of a CUDA Error of this type. Can you please launch a run with CUDA_LAUNCH_BLOCKING=1 and TORCH_DISTRIBUTED_DEBUG=1 ?

Apr 17 '24 08:04 mlinmg

您说的1.1.6指的是什么组件？我用的是2*3090，在架构上应该是可以支持的

---Original--- From: @.> Date: Wed, Apr 17, 2024 16:41 PM To: @.>; Cc: @.@.>; Subject: Re: [hiyouga/LLaMA-Factory] fix MoD related stuff (PR #3288)

Nothing in particular, it had gone away with 1.1.6. What hardware are you running it from? There are many possible causes of a CUDA Error of this type. Can you please launch a run with CUDA_LAUNCH_BLOCKING=1 and TORCH_DISTRIBUTED_DEBUG=1 ?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Apr 17 '24 09:04 camposs1979

Could you turn on the button to allow edits? I need to resolve the conflicts.

Btw, I find that there still exists some problem with SdpaAttention or FlashAttention in MoD implementation

Apr 17 '24 17:04 hiyouga

Seems pretty strange, both of them works on my end, what models arr you using? What gpu do you use?

Apr 17 '24 18:04 mlinmg

Could you turn on the button to allow edits? I need to resolve the conflicts.

Btw, I find that there still exists some problem with SdpaAttention or FlashAttention in MoD implementation

I don't actually have that option

Apr 17 '24 18:04 mlinmg

qwen1.5+flashattn2 on A100 gpu

Apr 17 '24 18:04 hiyouga

awesome thanks i'll try to debug it

Apr 17 '24 18:04 mlinmg

Could you turn on the button to allow edits? I need to resolve the conflicts.

Btw, I find that there still exists some problem with SdpaAttention or FlashAttention in MoD implementation

I don't actually have that option

or could you please rebase this pr on the latest main branch?

Apr 17 '24 18:04 hiyouga

No problem, I'll do it tomorrow. Alse please send the error it gives you, I can't manage to reproduce it

Apr 17 '24 20:04 mlinmg

It should be on par now

Apr 18 '24 09:04 mlinmg

seems that it still have conflicts with the main branch, I think you can open a new pr based on it

Apr 18 '24 09:04 hiyouga

As for the flash attn problem, we can fix it after merging the pr for the mod feature, since it works for most cases

Apr 18 '24 09:04 hiyouga

ping @mlinmg

Apr 18 '24 11:04 hiyouga

No problem, I'll do it tomorrow. Alse please send the error it gives you, I can't manage to reproduce it

script:

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --do_train \
    --model_name_or_path qwen1_5-0_5b-chat \
    --dataset alpaca_gpt4_en \
    --report_to none \
    --template default \
    --finetuning_type full \
    --mixture_of_depths convert \
    --output_dir mod/sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --cutoff_len 1024 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --optim adamw_8bit \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --warmup_steps 20 \
    --save_steps 100 \
    --eval_steps 100 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --max_samples 3000 \
    --val_size 0.1 \
    --plot_loss \
    --pure_bf16

trace back:

../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [5,0,0], thread: [0,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [5,0,0], thread: [1,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "src/train_bash.py", line 14, in <module>
    main()
  File "src/train_bash.py", line 5, in main
    run_exp()
  File "src/llmtuner/train/tuner.py", line 33, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "src/llmtuner/train/sft/workflow.py", line 71, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/transformers/trainer.py", line 1858, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/transformers/trainer.py", line 2202, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/transformers/trainer.py", line 3137, in training_step
    loss = self.compute_loss(model, inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/transformers/trainer.py", line 3160, in compute_loss
    outputs = model(**inputs)
              ^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1168, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1043, in forward
    layer_outputs = self._gradient_checkpointing_func(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/llmtuner/model/utils.py", line 131, in custom_gradient_checkpointing_func
    return gradient_checkpointing_func(func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/_compile.py", line 24, in inner
    return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/utils/checkpoint.py", line 482, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/utils/checkpoint.py", line 261, in forward
    outputs = run_function(*args)
              ^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/MoD/MoD.py", line 86, in forward
    processed_tokens[i][selected_mask[i]] = self.block(
                                            ^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 768, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 283, in forward
    attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

Apr 18 '24 17:04 hiyouga

cool I'll rebase it now

Apr 18 '24 17:04 mlinmg

too many dummy commit in this pr, open another one should be more efficient

btw, qwen1.5 models do not work for me in both non-FA2 and FA2 paths

Apr 18 '24 17:04 hiyouga

I'm using the TinyLlama-1.1B-Chat-v1.0 model, but soon the gradient will become nan.

May 23 '24 03:05 FB-wh

I think the problem here is that since the mod router is trying to adapt to a pretrained model, the gradients are very high and the weights are changed too much, you can try to set grad norm to something like 0.5/0.8 to see if it helps

May 23 '24 08:05 mlinmg

LLaMA-Factory LLaMA-Factory copied to clipboard

fix MoD related stuff

What does this PR do?

LLaMA-Factory
LLaMA-Factory copied to clipboard