ai-toolkit icon indicating copy to clipboard operation
ai-toolkit copied to clipboard

Problem for training WAN2.2 and Qwen Image OOM

Open rachidlamouchi opened this issue 1 month ago • 2 comments

This is for bugs only

Did you already ask in the discord?

No

You verified that this is a bug and not a feature request or question by asking in the discord?

Yes

Describe the bug

Hi, I should mention that I'm not a developer and I don't know anything about the command line.

For the past few months, following an update, it's no longer possible to train WAN2.2 or Qwen image, with the error mentioned below. I've tried everything with Chatgpt and Gemini, but I've never succeeded. Today I saw an update and thought it would fix it, but no, the problem persists. I should also mention that the problem continues even with a 512 training run using a batch of 1. I should also clarify that this isn't a VRAM issue, as I can train a Flux at 1024 with a batch size of 2 without any problems.

FOR WAN2.2 : CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

FOR QWEN IMAGE : CUDA out of memory. Tried to allocate 2.03 GiB. GPU 0 has a total capacity of 31.84 GiB of which 0 bytes is free. Of the allocated memory 29.69 GiB is allocated by PyTorch, and 138.42 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I'm using the "easy installation" version.

W11, Core i9 12900k, RTX5090, 64G RAM

rachidlamouchi avatar Nov 10 '25 12:11 rachidlamouchi