transformers Training with fp16 precision gives nan in Longt5

System Info

transformers version: 4.10.0.dev0
Platform: Linux-3.10.0-1160.62.1.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.8.13
PyTorch version (GPU?): 1.9.0+cu111 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes

Who can help?

@patrickvonplaten

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I'm currently running the scrolls_benchmark. I'm interested to see the performance of longt5 model on scrolls, so I changed the model name to google/long-t5-tglobal-base and run training with fp16 enabled (If I run with fp32, I get CUDA OOM errors). However, the output loss is always nan. I googled for fixes and found this post: t5-fp16-fixed. I searched in the transformers repo and found that the modelling_longt5 file doesn't seem to incorporate the clamp_value change. I wonder if this is the problem that fp16 is not working in longt5? And if so, is there a way to fix it by a similar approach like what you guys have done for t5? Thank you very much!

fyi: You probably noticed that the transformers version is 4.10.0 which does not have longt5. I manually added the longt5 files in a forked scrolls repo here longt5_folder. It indeed works properly under a small parameter setting.

Expected behavior

longt5 model not producing nan loss on fp16

Jul 01 '22 01:07 Leonard907

This is due to -1e10 is used as attention mask. A fix similar to what has done in #17306 should work. Would you like to open a PR for LongT5, @Leonard907 ?

Jul 01 '22 10:07 ydshieh

@ydshieh Sorry for the late reply, I ran a few experiments and found that fixing both the clamp value and attention mask seems to work (I should mention that at first I only fixed the attention mask and still gives nan. Then I added clamp value and there is no nan). For more detail, there are lines like longt5 704 that needs to be set to torch.finfo(dtype).min. For the clamp value part, my changes are identical to modeling_t5.py since I found the forward function to be similar. I did not read the code fully hence my modifications are hacky (i.e. using direct float16 attribute). Shall I publish a branch and open a PR for further review? This is my first time doing this so if there is a guide on how to publish changes and open PR I would be appreciated. Thanks!

Jul 02 '22 04:07 Leonard907

I had also NaN loss with google/mt5-small on PyTorch

Jul 04 '22 10:07 ZJaume

@Leonard907 @ZJaume

Could you provide code snippets (training script) that reproduce this issue? I am a bit surprised that torch.finfo(dtype).min is not enough to avoid this issue. But what have done in T5 seems reasonable!

Jul 04 '22 12:07 ydshieh

In general T5 just doesn't work well with fp16 since it was trained on bfloat16 and it seems like the model requires quite large values which are not supported by fp16. See: https://github.com/huggingface/transformers/issues/14189

Note that LongT5, MT5, T5, ByT5 all use the same architecture more or less

Jul 04 '22 12:07 patrickvonplaten

@ydshieh It was a bit hard to provide a training script as I'm experimenting on the SCROLLS dataset, there are a lot of setups and I also made some modifications myself. Probably one can reference this discussion stancld-longt5 and use the script by changing the bf16 to fp16, but I haven't given it a try.

@patrickvonplaten Thank you for mentioning more issues regarding this. I have trained the model using clamp_value and attention_mask fix and the results are poor. Without using fp16 I can actually get better results even using fewer parameters. Nevertheless, it's just a guess and I need to experiment on that more, but I thought it might be worth knowing.

FYI: How can I open a PR? I tried making a branch from main, but when I tried to push it I found that I had no permission to do so. For a workaround, I uploaded my changed file to one of my own repository here: longt5_fix. My fixes are hardcode, so it should be changed for general purposes.

Jul 05 '22 05:07 Leonard907

Please see https://huggingface.co/docs/transformers/contributing#start-contributing-pull-requests

(in short: you should create a branch, push to your fork, and open an PR)

Jul 05 '22 09:07 ydshieh

I'm actually against the clamp hack here. We've seen multiple times that it's not sufficient to correct the fp16 problem in T5 so I doubt it'll be sufficient here. However adding clamp at multiple positions in the code slows down the forward pass and makes the code less readable. @patil-suraj @stas00 what do you think here?

Jul 07 '22 09:07 patrickvonplaten

Based on my observations, using the clamp_value fix produces much worse results than just using fp32 under same configurations. And with the further comment from @patrickvonplaten , I realized that this fix also causes the training to take more time (I need around 1hr to generate predictions on test set with this fix, while using fp32 I only need around 20-30min).

Jul 07 '22 10:07 Leonard907

(not promoting clamp_value here) I actually think, if we want to add clamp_value, it should be done only when the model is in training mode. (if we do so, this also solves the slow down issue)

(in inference/eval mode, large values almost implies something is wrong in the model)

Jul 07 '22 10:07 ydshieh

Indeed, the clamping works around the overflow but it's not really a good mathematical approach. I suppose the model learns to deal with this type of penalty, but it doesn't make things smooth for the training.

the OP in https://github.com/huggingface/transformers/pull/10956 suggests penalizing large activation as it was done in the original code. it's done more consistently than clamping and probably easier for the model to learn.
Then a group of people developed a weight scaling approach (in addition to normal mixed precision scaling) https://github.com/huggingface/transformers/pull/10956#issuecomment-961030728

I also proposed removing clamping in t5: https://github.com/huggingface/transformers/pull/10956#issuecomment-951139733

Jul 07 '22 17:07 stas00

I have the same problem when training with bf16 set to True

Jul 15 '22 07:07 whiteRa2bit

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Aug 08 '22 15:08 github-actions[bot]

System Info
* `transformers` version: 4.10.0.dev0

* Platform: Linux-3.10.0-1160.62.1.el7.x86_64-x86_64-with-glibc2.17

* Python version: 3.8.13

* PyTorch version (GPU?): 1.9.0+cu111 (False)

* Tensorflow version (GPU?): not installed (NA)

* Flax version (CPU?/GPU?/TPU?): not installed (NA)

* Jax version: not installed

* JaxLib version: not installed

* Using GPU in script?: yes

* Using distributed or parallel set-up in script?: yes
Who can help?

@patrickvonplaten

Information
* [ ]  The official example scripts

* [x]  My own modified scripts
Tasks
* [ ]  An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)

* [x]  My own task or dataset (give details below)
Reproduction

I'm currently running the scrolls_benchmark. I'm interested to see the performance of longt5 model on scrolls, so I changed the model name to google/long-t5-tglobal-base and run training with fp16 enabled (If I run with fp32, I get CUDA OOM errors). However, the output loss is always nan. I googled for fixes and found this post: t5-fp16-fixed. I searched in the transformers repo and found that the modelling_longt5 file doesn't seem to incorporate the clamp_value change. I wonder if this is the problem that fp16 is not working in longt5? And if so, is there a way to fix it by a similar approach like what you guys have done for t5? Thank you very much!

fyi: You probably noticed that the transformers version is 4.10.0 which does not have longt5. I manually added the longt5 files in a forked scrolls repo here longt5_folder. It indeed works properly under a small parameter setting.

Expected behavior

longt5 model not producing nan loss on fp16

It appears this is the case for mistral as well

Mar 29 '24 07:03 vnica2

transformers transformers copied to clipboard

Training with fp16 precision gives nan in Longt5

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard