diffusers
diffusers copied to clipboard
SD3 and Gradient checkpointing gives error and crashes
Describe the bug
Describe the bug
Activating --gradient_checkpointing in either Lora or DB scripts for SD3 causes: TypeError: layer_norm(): argument 'input' (position 1) must be Tensor, not tuple, which crashes the run, without it, LoRA runs fine at about 20GB vram usage batch size 1 with AdamW8bit
Reproduction
Add --gradient_checkpointing to training parameters.
Logs
No response
System Info
- 🤗 Diffusers version: 0.29.0.dev0
- Platform: Windows-10-10.0.19045-SP0
- Running on a notebook?: No
- Running on Google Colab?: No
- Python version: 3.10.11
- PyTorch version (GPU?): 2.2.1+cu118 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.23.3
- Transformers version: 4.41.2
- Accelerate version: 0.31.0
- PEFT version: 0.11.1
- Bitsandbytes version: 0.43.0
- Safetensors version: 0.4.2
- xFormers version: not installed
- Accelerator: NVIDIA GeForce RTX 3090, 24576 MiB NVIDIA GeForce RTX 4090, 24564 MiB VRAM
- Using GPU in script?: RTX 4090
- Using distributed or parallel set-up in script?: No DDP or similar parallel setups.
Who can help?
No response
i wish i'd looked sooner, haha. i was hunting this one down.
@sayakpaul @DN6 i can confirm this one
Can confirm with --gradient_checkpointing this error happens. With the LoRA training.
diffusers 0.29.0
I have fixed this here: https://github.com/huggingface/diffusers/pull/8542
Since #8542 was merged, can we close this?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Closing this since #8542 seems like the fix and due to inactivity to @DN6's question. If the issue still persists, please LMK and re-open this so we can work on it asap