diffusers icon indicating copy to clipboard operation
diffusers copied to clipboard

SD3 and Gradient checkpointing gives error and crashes

Open bluvoll opened this issue 1 year ago • 5 comments

Describe the bug

Describe the bug

Activating --gradient_checkpointing in either Lora or DB scripts for SD3 causes: TypeError: layer_norm(): argument 'input' (position 1) must be Tensor, not tuple, which crashes the run, without it, LoRA runs fine at about 20GB vram usage batch size 1 with AdamW8bit

imagen

Reproduction

Add --gradient_checkpointing to training parameters.

Logs

No response

System Info

  • 🤗 Diffusers version: 0.29.0.dev0
  • Platform: Windows-10-10.0.19045-SP0
  • Running on a notebook?: No
  • Running on Google Colab?: No
  • Python version: 3.10.11
  • PyTorch version (GPU?): 2.2.1+cu118 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Huggingface_hub version: 0.23.3
  • Transformers version: 4.41.2
  • Accelerate version: 0.31.0
  • PEFT version: 0.11.1
  • Bitsandbytes version: 0.43.0
  • Safetensors version: 0.4.2
  • xFormers version: not installed
  • Accelerator: NVIDIA GeForce RTX 3090, 24576 MiB NVIDIA GeForce RTX 4090, 24564 MiB VRAM
  • Using GPU in script?: RTX 4090
  • Using distributed or parallel set-up in script?: No DDP or similar parallel setups.

Who can help?

No response

bluvoll avatar Jun 13 '24 00:06 bluvoll

i wish i'd looked sooner, haha. i was hunting this one down.

bghira avatar Jun 13 '24 01:06 bghira

@sayakpaul @DN6 i can confirm this one

bghira avatar Jun 13 '24 01:06 bghira

Can confirm with --gradient_checkpointing this error happens. With the LoRA training.

diffusers 0.29.0

rockerBOO avatar Jun 13 '24 01:06 rockerBOO

I have fixed this here: https://github.com/huggingface/diffusers/pull/8542

RefractAI avatar Jun 13 '24 21:06 RefractAI

Since #8542 was merged, can we close this?

DN6 avatar Jul 01 '24 10:07 DN6

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Sep 14 '24 15:09 github-actions[bot]

Closing this since #8542 seems like the fix and due to inactivity to @DN6's question. If the issue still persists, please LMK and re-open this so we can work on it asap

a-r-r-o-w avatar Nov 18 '24 20:11 a-r-r-o-w