diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

Attempting to unscale FP16 gradients

Open cian0 opened this issue 1 year ago • 1 comments

Describe the bug

The script wouldn't start the training steps due to the error in the title

Reproduction

No response

Logs

Steps:   0%|                                                                                       | 0/800 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/workspace/sdw/examples/dreambooth/train_dreambooth.py", line 812, in <module>
    main(args)
  File "/workspace/sdw/examples/dreambooth/train_dreambooth.py", line 784, in main
    optimizer.step()
  File "/opt/conda/lib/python3.9/site-packages/accelerate/optimizer.py", line 134, in step
    self.scaler.step(self.optimizer, closure)
  File "/opt/conda/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 337, in step
    self.unscale_(optimizer)
  File "/opt/conda/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 282, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
  File "/opt/conda/lib/python3.9/site-packages/torch/cuda/amp/grad_scaler.py", line 210, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

System Info

my pip list: absl-py 1.3.0 accelerate 0.14.0 aiohttp 3.8.3 aiosignal 1.2.0 anyio 3.6.2 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 asttokens 2.0.5 astunparse 1.6.3 async-timeout 4.0.2 attrs 22.1.0 awscli 1.27.8 Babel 2.11.0 backcall 0.2.0 bash_kernel 0.8.0 bcrypt 4.0.1 beautifulsoup4 4.11.1 bitsandbytes 0.35.4 bleach 5.0.1 botocore 1.29.8 brotlipy 0.7.0 cachetools 5.2.0 certifi 2022.9.24 cffi 1.15.0 chardet 4.0.0 charset-normalizer 2.0.4 click 8.1.3 cmake 3.24.3 colorama 0.4.4 conda 22.9.0 conda-build 3.22.0 conda-content-trust 0+unknown conda-package-handling 1.8.1 contourpy 1.0.6 cryptography 36.0.0 cycler 0.11.0 debugpy 1.6.3 decorator 5.1.1 defusedxml 0.7.1 diffusers 0.8.0.dev0 docutils 0.16 entrypoints 0.4 exceptiongroup 1.0.0 executing 0.8.3 expecttest 0.1.4 fastapi 0.86.0 fastjsonschema 2.16.2 ffmpy 0.3.0 filelock 3.6.0 fonttools 4.38.0 frozenlist 1.3.1 fsspec 2022.10.0 ftfy 6.1.1 future 0.18.2 glob2 0.7 google-auth 2.14.1 google-auth-oauthlib 0.4.6 gradio 3.9 grpcio 1.50.0 h11 0.12.0 httpcore 0.15.0 httpx 0.23.0 huggingface-hub 0.10.1 hypothesis 6.56.4 idna 3.3 importlib-metadata 5.0.0 iniconfig 1.1.1 ipykernel 6.17.1 ipython 8.4.0 ipython-genutils 0.2.0 ipywidgets 8.0.2 jedi 0.18.1 Jinja2 3.1.2 jmespath 1.0.1 json5 0.9.10 jsonschema 4.17.0 jupyter 1.0.0 jupyter-archive 3.3.2 jupyter_client 7.4.5 jupyter-console 6.4.4 jupyter_core 5.0.0 jupyter-http-over-ws 0.0.8 jupyter-server 1.23.2 jupyterlab 3.5.0 jupyterlab-pygments 0.2.2 jupyterlab_server 2.16.3 jupyterlab-widgets 3.0.3 kiwisolver 1.4.4 libarchive-c 2.9 linkify-it-py 1.0.3 Markdown 3.4.1 markdown-it-py 2.1.0 MarkupSafe 2.1.1 matplotlib 3.6.2 matplotlib-inline 0.1.6 mdit-py-plugins 0.3.1 mdurl 0.1.2 mistune 2.0.4 mkl-fft 1.3.1 mkl-random 1.2.2 mkl-service 2.4.0 modelcards 0.1.6 mpmath 1.2.1 multidict 6.0.2 mypy-extensions 0.4.3 natsort 8.2.0 nbclassic 0.4.8 nbclient 0.7.0 nbconvert 7.2.5 nbformat 5.7.0 nbzip 0.1.0 nest-asyncio 1.5.6 notebook 6.5.2 notebook_shim 0.2.2 numpy 1.22.3 oauthlib 3.2.2 orjson 3.8.1 packaging 21.3 pandas 1.5.1 pandocfilters 1.5.0 paramiko 2.12.0 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.0.1 pip 21.2.4 pkginfo 1.8.3 platformdirs 2.5.4 pluggy 1.0.0 prometheus-client 0.15.0 prompt-toolkit 3.0.20 protobuf 3.20.3 psutil 5.8.0 ptyprocess 0.7.0 pure-eval 0.2.2 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycosat 0.6.3 pycparser 2.21 pycryptodome 3.15.0 pydantic 1.10.2 pydub 0.25.1 Pygments 2.11.2 PyNaCl 1.5.0 pyOpenSSL 22.0.0 pyparsing 3.0.9 pyre-extensions 0.0.23 pyrsistent 0.19.2 PySocks 1.7.1 pytest 7.2.0 python-dateutil 2.8.2 python-multipart 0.0.5 pytz 2022.1 PyYAML 5.4.1 pyzmq 24.0.1 qtconsole 5.4.0 QtPy 2.3.0 regex 2022.10.31 requests 2.27.1 requests-oauthlib 1.3.1 rfc3986 1.5.0 rsa 4.7.2 ruamel-yaml-conda 0.15.100 s3transfer 0.6.0 Send2Trash 1.8.0 setuptools 61.2.0 six 1.16.0 sniffio 1.3.0 sortedcontainers 2.4.0 soupsieve 2.3.2.post1 stack-data 0.2.0 starlette 0.20.4 sympy 1.11.1 tensorboard 2.11.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 terminado 0.17.0 tinycss2 1.2.1 tokenizers 0.13.1 toml 0.10.2 tomli 2.0.1 toolz 0.11.2 torch 1.13.0 torchtext 0.14.0 torchvision 0.14.0 tornado 6.2 tqdm 4.63.0 traitlets 5.5.0 transformers 4.24.0 triton 2.0.0.dev20221105 types-dataclasses 0.6.6 typing_extensions 4.4.0 typing-inspect 0.8.0 uc-micro-py 1.0.1 urllib3 1.26.8 uvicorn 0.19.0 wcwidth 0.2.5 webencodings 0.5.1 websocket-client 1.4.2 websockets 10.4 Werkzeug 2.2.2 wheel 0.37.1 widgetsnbextension 4.0.3 xformers 0.0.14.dev0 yarl 1.8.1 zipp 3.10.0

I've tried in vast ai with these machines: RTX 3090 CUDA 11.4

A6000 CUDA 11.7

  • diffusers version: 0.8.0.dev0
  • Platform: Linux-5.4.0-81-generic-x86_64-with-glibc2.27
  • Python version: 3.9.12
  • PyTorch version (GPU?): 1.13.0 (True)
  • Huggingface_hub version: 0.10.1
  • Transformers version: 4.24.0
  • Using GPU in script?: RTS 3090/A6000 in vast
  • Using distributed or parallel set-up in script?: NO

cian0 avatar Nov 14 '22 17:11 cian0