notebooks icon indicating copy to clipboard operation
notebooks copied to clipboard

Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb loss is all NaN

Open z3ugma opened this issue 1 year ago • 25 comments
trafficstars

peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb This notebook: https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb

Trains fine on Google Colab at https://colab.research.google.com/drive/16XbIysCzgpAld7Kd9-xz-23VPWmqdWmW?usp=sharing#scrollTo=upI97XEH6EKe

using Python 3.10.12, Torch 2.1.0

It does not train on my workstation - the loss collapses to NaN after just a few epochs:

Loss: 6.078125
['a soccer player with his arms up in the air\n', 'mario balotelli celebrates after scoring against juventus\n']
Loss: 3.630859375
Loss: 4.01171875
Epoch: 2
Loss: 4.48046875
['cristiano ronaldo is the most expensive player in the world\n', 'a soccer player with his arms raised in celebration\n']
Loss: 3.25
Loss: 4.2734375
Epoch: 3
Loss: 4.0625
['a bald soccer player with a white shirt and blue shorts\n', "Juventus' Mario Mandzukic celebrates after scoring against Barcelona\n"]
Loss: 3.01953125
Loss: nan
Epoch: 4
Loss: nan

My workstation is also using Python 3.10.13, and Torch 2.1.0. What could be causing the loss to be all nan?

z3ugma avatar Dec 04 '23 23:12 z3ugma

@younesbelkada you're the author of that sample notebook and the keeper of the football dataset on Hugging Face - any idea what might be causing the loss to go to nan?

z3ugma avatar Dec 04 '23 23:12 z3ugma

@z3ugma the same problem. I started to get the nan loss in the 2nd batch of epoch 0. Have you solved it?

triangle959 avatar Dec 13 '23 06:12 triangle959

I found an interesting thing. On Google Colab, the loss will not change to nan. It seems that there are still differences between Colab and the local notebook.

triangle959 avatar Dec 13 '23 08:12 triangle959

I also ran into this issue recently with finetuning BLIP2, whereas it was working before. I haven't had a chance to pin it down, but it might be a package version issue with something introducing a breaking change?

jeffliu-LL avatar Dec 14 '23 18:12 jeffliu-LL

Rolling back to peft=0.5.0 was able to get the blip2 example working for me

jeffliu-LL avatar Dec 14 '23 19:12 jeffliu-LL

@jeffliu-LL which pytorch version are you using?

AntoniaSch avatar Dec 15 '23 09:12 AntoniaSch

pytorch 2.0.1 with pytorch-cuda 11.8

jeffliu-LL avatar Dec 19 '23 19:12 jeffliu-LL

I will try rolling back to peft 0.5 with cuda 12.2 and Python 3.11.

Will report back

z3ugma avatar Dec 23 '23 19:12 z3ugma

No, still a problem:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Torch 2.1.2+cu121
Datasets 2.16.0
Python 3.11.7 | packaged by conda-forge | (main, Dec 15 2023, 08:38:37) [GCC 12.3.0]
PEFT 0.5.0

z3ugma avatar Dec 24 '23 04:12 z3ugma

Still shows all nan after downgrading PyTorch and PEFT unfortunately

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Torch 2.0.1+cu117
Datasets 2.16.0
Transformers 4.36.2
Python 3.11.7 | packaged by conda-forge | (main, Dec 15 2023, 08:38:37) [GCC 12.3.0]
PEFT 0.5.0

z3ugma avatar Dec 24 '23 04:12 z3ugma

@jeffliu-LL will you put your versions of Python, PyTorch, Transformers, and cuda from the working environment?

z3ugma avatar Dec 24 '23 04:12 z3ugma

Here are the packages from the working Google Colab environment:

Working: 
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Torch 2.0.1+cu117
Datasets 2.16.0
Transformers 4.35.2
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
PEFT 0.5.0

z3ugma avatar Dec 24 '23 04:12 z3ugma

Still not working on Python 3.10 some other version details of another nonworking version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Torch 2.0.1+cu117
Datasets 2.16.0
Transformers 4.35.2
Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]
PEFT 0.5.0
SciPy 1.11.4
Pillow 9.4.0

z3ugma avatar Dec 24 '23 19:12 z3ugma

seems like we meet the same problem :(

OS: Windows 10 CUDA: 11.8 Python 3.9.18 (main, Sep 11 2023, 14:09:26) [MSC v.1916 64 bit (AMD64)] on win32

Torch 2.1.2+cu118 Datasets 2.16.1 Transformers 4.36.2 PEFT 0.7.1 bitsandbytes 0.41.0

bit-bcilab avatar Jan 11 '24 10:01 bit-bcilab

1

bit-bcilab avatar Jan 11 '24 10:01 bit-bcilab

@z3ugma I meet the similar issue.The loss change to nan after epoch 0. Have you fix it? dataset: jpawan33/kag100-image-captioning-dataset pytorch:1.13.0 cuda:11.3 python: 3.9 PEFT: 0.7.2.dev0 transformers: 4.36.2

wushandinghua avatar Jan 20 '24 14:01 wushandinghua

@wushandinghua no, I've not yet had success

z3ugma avatar Jan 21 '24 18:01 z3ugma

Any solutions? have same problem

pribadihcr avatar Mar 05 '24 15:03 pribadihcr

I changed the type from torch.float16 to torch.float32. This works for me. Hope this will also work for you.

eddie221 avatar Mar 06 '24 07:03 eddie221

I changed the type from torch.float16 to torch.float32. This works for me. Hope this will also work for you.

change this: pixel_values = batch.pop("pixel_values").to(device, torch.float32)

still has same problem

pribadihcr avatar Mar 06 '24 12:03 pribadihcr

peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb 此笔记本: https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb

在 Google Colab 上训练效果良好:https://colab.research.google.com/drive/16XbIysCzgpAld7Kd9-xz-23VPWmqdWmW ?usp=sharing#scrollTo=upI97XEH6EKe

使用Python 3.10.12、Torch 2.1.0

它不在我的工作站上进行训练 - 损失在几个时期后就崩溃为 NaN:

Loss: 6.078125
['a soccer player with his arms up in the air\n', 'mario balotelli celebrates after scoring against juventus\n']
Loss: 3.630859375
Loss: 4.01171875
Epoch: 2
Loss: 4.48046875
['cristiano ronaldo is the most expensive player in the world\n', 'a soccer player with his arms raised in celebration\n']
Loss: 3.25
Loss: 4.2734375
Epoch: 3
Loss: 4.0625
['a bald soccer player with a white shirt and blue shorts\n', "Juventus' Mario Mandzukic celebrates after scoring against Barcelona\n"]
Loss: 3.01953125
Loss: nan
Epoch: 4
Loss: nan

我的工作站也使用 Python 3.10.13 和 Torch 2.1.0。是什么导致损失全部为nan?

Google Colab

Can you send me a copy of the code that deploys you locally? I also want to try it on my own computer instead of using Google Colab. Thank you very much!

shams2023 avatar Mar 13 '24 08:03 shams2023

1

Can you send me a copy of the code that deploys you locally? I also want to try it on my own computer instead of using Google Colab. Thank you !

shams2023 avatar Mar 13 '24 08:03 shams2023

I changed the type from torch.float16 to torch.float32. This works for me. Hope this will also work for you.

change this: pixel_values = batch.pop("pixel_values").to(device, torch.float32)

still has same problem

Sorry for the late reply. I also change the model type from torch.float16 to torch.float32 There will need to be two modifications to the code:

  1. model = Blip2ForConditionalGeneration.from_pretrained("ybelkada/blip2-opt-2.7b-fp16-sharded", device_map="auto", load_in_8bit=True, torch_dtype=torch.float32)
  2. pixel_values = batch.pop("pixel_values").to(device, torch.float32) Here is the notebook of my testing result: https://colab.research.google.com/drive/1j2jey-OqmtUa3IcI1kOcswWAWmiG4JKJ?usp=sharing

eddie221 avatar Mar 13 '24 09:03 eddie221

1

Can you send me a copy of the code that deploys you locally? I also want to try it on my own computer instead of using Google Colab. Thank you !

I test with the same code as the https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb on my computer.

eddie221 avatar Mar 13 '24 09:03 eddie221

1

您能否向我发送一份在本地部署您的代码的副本?我也想在我自己的电脑上尝试一下,而不是使用 Google Colab。谢谢!

我使用与计算机上 https://github.com/huggingface/notebooks/blob/main/peft/Fine_tune_BLIP2_on_an_image_captioning_dataset_PEFT.ipynb 相同的代码进行测试。

Okay, I will deploy it to my own PyCharm for experimentation

shams2023 avatar Mar 13 '24 09:03 shams2023