ai-toolkit icon indicating copy to clipboard operation
ai-toolkit copied to clipboard

OOM with 32GB 5090 when trying Qwen-Image training

Open NeedForSpeed73 opened this issue 1 month ago • 11 comments

This is for bugs only

Did you already ask in the discord?

Yes

You verified that this is a bug and not a feature request or question by asking in the discord?

Yes

Describe the bug

When trying to replicate the workflow in this official Ostris YouTube channel video, I get an OOM error before the actual training starts, when just creating samples. According to another Discord help request, seems like the thing is common and has arisen since a few weeks (some update broke it?).

I'm running Ubuntu 24.04 (kernel 6.14.0-33). python 3.13.9, nvidia-driver-580-open (CUDA 13.0) and pytorch 2.9.0+cu128.

[1;35mtorch.OutOfMemoryError[0m: [35mCUDA out of memory. Tried to allocate 5.06 GiB. GPU 0 has a total capacity of 31.36 GiB of which 336.62 MiB is free. Including non-PyTorch memory, this process has 30.68 GiB memory in use. Of the allocated memory 27.17 GiB is allocated by PyTorch, and 2.92 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)[0m

NeedForSpeed73 avatar Oct 26 '25 12:10 NeedForSpeed73

Yes it seems, look at this comment https://github.com/ostris/ai-toolkit/issues/457#issuecomment-3393679558

kesslerdev avatar Oct 26 '25 21:10 kesslerdev

Yes it seems, look at this comment #457 (comment)

I've tried the suggested workaround but still get the same exact error.

What I did was: git reset --hard c6edd71

then I launched the ui again:

cd ui npm run build_and_start

I double checked the changed python source files to be sure I was actually using the older commit and they are indeed the old ones.

NeedForSpeed73 avatar Oct 27 '25 14:10 NeedForSpeed73

I'm having the same issue, I'm running this on Arch Linux in a container, specifically CachyOS if this makes any difference. It was working fine on windows11 when I was using the automatic installer.

I'm using the Dockerfile that is located this git.

Error running job: CUDA out of memory. Tried to allocate 4.45 GiB. GPU 0 has a total capacity of 31.35 GiB of which 1.61 GiB is free. Including non-PyTorch memory, this process has 28.64 GiB memory in use. Of the allocated memory 22.70 GiB is allocated by PyTorch, and 5.34 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I have RTX 5090

Also tried to revert to c6edd71 build but still getting the same issue.

NeoLoger avatar Oct 28 '25 21:10 NeoLoger

As others, I have tried different commits and run into similar issues. I have an old working install, with newer commits than many recommended so don't think its directly an ai-toolkit issue, the difference was my old virtual environment still worked without OOM errors.

CUDA has updated to 13.0 on Arch, using the recommended install command, you get the sm_20 errors or whatever for 50 series, which would drive me to install the latest pytorch for 50 series. That doesnt work and is causing OOM errors for me.

Installing these versions of torch for 12.9 specifically seems to fix my OOM issues on any commit.

pip3 install --no-cache-dir torch==2.8.0+cu129 torchvision==0.23.0+cu129 torchaudio==2.8.0+cu129 --index-url https://download.pytorch.org/whl/cu129 pip3 install -r requirements.txt

I hope this helps someone!

Nynxz avatar Nov 04 '25 07:11 Nynxz

@Nynxz Thank you! Can confirm, running perfectly without any OOM issues!

I'm running it inside a docker container form the git repo, all I did is to change one line based on your comment: From:

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt && \
    pip install --pre --no-cache-dir torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128 --force && \
    pip install setuptools==69.5.1 --no-cache-dir

To:

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt && \
    pip install --no-cache-dir torch==2.8.0+cu129 torchvision==0.23.0+cu129 torchaudio==2.8.0+cu129 --index-url https://download.pytorch.org/whl/cu129 --force && \
    pip install setuptools==69.5.1 --no-cache-dir

NeoLoger avatar Nov 04 '25 12:11 NeoLoger

@Nynxz Thanks. That was the solution also for me. Btw I'm also running Qwen training in Xfce and Microsoft Edge to get it the lightest it can be.

NeedForSpeed73 avatar Nov 06 '25 21:11 NeedForSpeed73

Exact same boat, CachyOS, and running it as a docker container. So I'm dumb and I don't understand where that command gets placed. Did you create a custom dockerfile to use it? or is it placed somewhere else. I'm using the compose file from git to build my container, but am still a kinda new and not sure how to fix this.

battousaifader avatar Nov 16 '25 23:11 battousaifader

Latest version, I can only train Wan 2.2 T2V with 6-bit on the transformer. Anything above, I get OOM.

I could train with float8 by reverting to the commit mentioned above, plus the torch on cuda 12.9.

Quite annoying to lose the queue feature as it was added after that commit (maybe I can cherry-pick that one), but hopefully it gets fixed soon.

[Update 1]: Interesting enough, I made 2 changes that worked for me on the latest version:

1 - ~~Applied these changes from @relaxis: https://github.com/ostris/ai-toolkit/commit/5e5e9dbd53773bb99241a3bf9320afaff77944e7#diff-79b334e98ad31800d9bdfdd9a036f5a358427d97db94b8c1f9e50ffd11da380bL2200-R2204~~

2 - Set "low_vram: false" in the config

Then I could train back Wan 2.2 T2V with float8 transformer successfully.

[Update 2]: Step 1 above is not really needed. It works only with low_vram set to false.

De-Zoomer avatar Nov 21 '25 19:11 De-Zoomer

Because my changes require cuda 13 and pytorch nightly (October 2025)

On Fri, Nov 21, 2025 at 9:36 PM DeZoomer @.***> wrote:

De-Zoomer left a comment (ostris/ai-toolkit#484) https://github.com/ostris/ai-toolkit/issues/484#issuecomment-3564299187

Latest version, I can only train Wan 2.2 T2V with 6-bit on the transformer. Anything above, I get OOM.

I could train with float8 by reverting to the commit mentioned above, plus the torch on cuda 12.9.

Quite annoying to lose the queue feature as it was added after that commit (maybe I can cherry-pick that one), but hopefully it gets fixed soon.

[Update]: Interesting enough, I made 2 changes that worked for me on the latest version:

1 - Applied these changes from @relaxis https://github.com/relaxis: 5e5e9db #diff-79b334e98ad31800d9bdfdd9a036f5a358427d97db94b8c1f9e50ffd11da380bL2200-R2204 https://github.com/ostris/ai-toolkit/commit/5e5e9dbd53773bb99241a3bf9320afaff77944e7#diff-79b334e98ad31800d9bdfdd9a036f5a358427d97db94b8c1f9e50ffd11da380bL2200-R2204

2 - Set "low_vram: false" in the config

Then I could train back Wan 2.2 T2V with float8 transformer successfully.

— Reply to this email directly, view it on GitHub https://github.com/ostris/ai-toolkit/issues/484#issuecomment-3564299187, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABK3FJB3TJAKHJFWDG7SXWD355ZVRAVCNFSM6AAAAACKHYTYP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKNRUGI4TSMJYG4 . You are receiving this because you were mentioned.Message ID: @.***>

relaxis avatar Nov 21 '25 20:11 relaxis

I tried the solution from @Nynxz, but unfortunately I still run OOM on the latest commit. I am on Windows 11, though. The only thing working for me is rolling back to commit #c6edd71.

Herojayjay avatar Nov 30 '25 16:11 Herojayjay

Well there’s your problem.

On Sun, 30 Nov 2025 at 17:26, Herojayjay @.***> wrote:

Herojayjay left a comment (ostris/ai-toolkit#484) https://github.com/ostris/ai-toolkit/issues/484#issuecomment-3592785434

I tried the solution from @Nynxz https://github.com/Nynxz, but unfortunately I still run OOM on the latest commit. I am on Windows 11, though. The only thing working for me is rolling back to commit #c6edd71.

— Reply to this email directly, view it on GitHub https://github.com/ostris/ai-toolkit/issues/484#issuecomment-3592785434, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABK3FJHWC4ON2WYEKFB7HBL37MLCLAVCNFSM6AAAAACKHYTYP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKOJSG44DKNBTGQ . You are receiving this because you were mentioned.Message ID: @.***>

relaxis avatar Nov 30 '25 18:11 relaxis