SummaryMixing The grad norm is nan

Hi author, I'm getting the following when training branchformer using summary_mixing

[autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:12,899 (ctc:67) WARNING: 13/34 samples got nan grad. These were ignored for CTC loss. [autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:13,133 (build_trainer:660) WARNING: The grad norm is nan. Skipping updating the model. [autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:13,263 (ctc:67) WARNING: 7/32 samples got nan grad. These were ignored for CTC loss. [autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:13,477 (build_trainer:660) WARNING: The grad norm is nan. Skipping updating the model. [autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:13,625 (ctc:67) WARNING: 21/45 samples got nan grad. These were ignored for CTC loss. [autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:13,858 (build_trainer:660) WARNING: The grad norm is nan. Skipping updating the model. [autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:14,022 (ctc:67) WARNING: 21/62 samples got nan grad. These were ignored for CTC loss. [autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:14,248 (build_trainer:660) WARNING: The grad norm is nan. Skipping updating the model. [autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:14,499 (ctc:67) WARNING: 37/105 samples got nan grad. These were ignored for CTC loss. [autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:14,735 (build_trainer:660) WARNING: The grad norm is nan. Skipping updating the model. [autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:14,875 (ctc:67) WARNING: 12/39 samples got nan grad. These were ignored for CTC loss. [autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:15,104 (build_trainer:660) WARNING: The grad norm is nan. Skipping updating the model. [autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:15,261 (ctc:67) WARNING: 23/56 samples got nan grad. These were ignored for CTC loss. [autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:15,479 (build_trainer:660) WARNING: The grad norm is nan. Skipping updating the model. [autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:15,623 (ctc:67) WARNING: 20/47 samples got nan grad. These were ignored for CTC loss. [autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:15,854 (build_trainer:660) WARNING: The grad norm is nan. Skipping updating the model. [autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:16,004 (ctc:67) WARNING: 15/53 samples got nan grad. These were ignored for CTC loss. [autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:16,224 (build_trainer:660) WARNING: The grad norm is nan. Skipping updating the model.

Why is this

Apr 10 '24 09:04 sister-tong

Hello there, we would need much more information about what the model/trainer/data/task is to give you an answer. SummaryMixing does not, in itself, induce more instability during training than MHSA. With more information on the code, we could try to help.

Apr 10 '24 10:04 TParcollet

I tried to print the output of summary_mixing and the tensor shows that there is Nan, what is the reason for this

Apr 11 '24 05:04 sister-tong

Hi, we need much more information to help you here I am afraid. This could be due to many reasons that are all most likely not connected to SummaryMixing. Please describe your setup.

Apr 11 '24 13:04 TParcollet

Hi, when I print the encoder input when trying to use summing_mixing I find nan in it, but when I make RelPositionMultiHeadedAttention the input has no nan. This is my configuration environment, the exact model configuration and the encoder structure is in the zip.

  linux：Ubuntu 20.04.4
  python=3.8.18
  torch=2.0.1
  funasr=0.8.2
  modelscope=1.9.3

code.zip

Apr 12 '24 01:04 sister-tong

Hello,

I've had a quick look at your code, but I am way too unfamiliar with this codebase to make any meaningful comment. My only comment would be that we never encountered any NaN issue with summarymixing so it might not be plugged-in properly (be careful with the masking for instance).

Apr 12 '24 08:04 TParcollet

SummaryMixing SummaryMixing copied to clipboard

The grad norm is nan

SummaryMixing
SummaryMixing copied to clipboard