Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
I managed to install the previous version and trained for a few days. I had excellent results.
Now, using the same settings, I can no longer start the training. I’m using an RTX 5090. I also reinstalled it on two other computers — one with an RTX 4090 and another with an RTX 3090. The same error happens on all of them, regardless of the number of training images for a Wan 2.2 or the configurations.
Does anyone know how to fix this? I haven’t modified anything. I believe the executable file may have caused AITOOLKIT to run and update itself, creating some incompatibility. I’ve reinstalled it several times, but I keep getting the same error.
`Running 1 job W1009 22:44:38.061000 48028 venv\Lib\site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs. { "type": "diffusion_trainer", "training_folder": "C:\AITOOLKIT\AI-Toolkit\output", "sqlite_db_path": "C:\AITOOLKIT\AI-Toolkit\aitk_db.db", "device": "cuda", "trigger_word": "brunnosarttori", "performance_log_every": 10, "network": { "type": "lora", "linear": 32, "linear_alpha": 32, "conv": 16, "conv_alpha": 16, "lokr_full_rank": true, "lokr_factor": -1, "network_kwargs": { "ignore_if_contains": [] } }, "save": { "dtype": "bf16", "save_every": 250, "max_step_saves_to_keep": 4, "save_format": "diffusers", "push_to_hub": false }, "datasets": [ { "folder_path": "C:\AITOOLKIT\AI-Toolkit\datasets/dsdfsdf", "mask_path": null, "mask_min_value": 0.1, "default_caption": "", "caption_ext": "txt", "caption_dropout_rate": 0.05, "cache_latents_to_disk": false, "is_reg": false, "network_weight": 1, "resolution": [ 512, 768, 1024 ], "controls": [], "shrink_video_to_frames": true, "num_frames": 1, "do_i2v": true, "flip_x": false, "flip_y": false } ], "train": { "batch_size": 1, "bypass_guidance_embedding": false, "steps": 3000, "gradient_accumulation": 1, "train_unet": true, "train_text_encoder": false, "gradient_checkpointing": true, "noise_scheduler": "flowmatch", "optimizer": "adamw8bit", "timestep_type": "linear", "content_or_style": "balanced", "optimizer_params": { "weight_decay": 0.0001 }, "unload_text_encoder": false, "cache_text_embeddings": false, "lr": 0.0001, "ema_config": { "use_ema": false, "ema_decay": 0.99 }, "skip_first_sample": false, "force_first_sample": false, "disable_sampling": false, "dtype": "bf16", "diff_output_preservation": false, "diff_output_preservation_multiplier": 1, "diff_output_preservation_class": "person", "switch_boundary_every": 1, "loss_type": "mse" }, "model": { "name_or_path": "ai-toolkit/Wan2.2-T2V-A14B-Diffusers-bf16", "quantize": true, "qtype": "qfloat8", "quantize_te": true, "qtype_te": "qfloat8", "arch": "wan22_14b:t2v", "low_vram": true, "model_kwargs": { "train_high_noise": false, "train_low_noise": true } }, "sample": { "sampler": "flowmatch", "sample_every": 250, "width": 1024, "height": 1024, "samples": [ { "prompt": "brunnosarttori with red hair, playing chess at the park, bomb going off in the background" } ], "neg": "", "seed": 42, "walk_seed": true, "guidance_scale": 4, "sample_steps": 25, "num_frames": 1, "fps": 16 } } Using SQLite database at C:\AITOOLKIT\AI-Toolkit\aitk_db.db Job ID: "6b06af2e-bb35-4360-add9-4ca0b12c3e42"
#############################################
Running job: my_first_lora_v1
#############################################
Running 1 process Loading Wan model Loading transformer 1 Loading checkpoint shards: 100%|#########################################################| 3/3 [00:00<00:00, 20.69it/s] Quantizing Transformer 1
- quantizing 40 transformer blocks 100%|##################################################################################| 40/40 [00:45<00:00, 1.14s/it]
- quantizing extras Moving transformer 1 to CPU Loading transformer 2 Loading checkpoint shards: 100%|#########################################################| 3/3 [00:00<00:00, 5.64it/s] Quantizing Transformer 2
- quantizing 40 transformer blocks
55%|#############################################1 | 22/40 [00:44<00:36, 2.01s/it]
Error running job: CUDA error: out of memory
Compile with
TORCH_USE_CUDA_DSAto enable device-side assertions.
======================================== Result:
- 0 completed jobs
- 1 failure
========================================
Traceback (most recent call last):
File "C:\AITOOLKIT\AI-Toolkit\run.py", line 120, in
main() File "C:\AITOOLKIT\AI-Toolkit\run.py", line 108, in main raise e File "C:\AITOOLKIT\AI-Toolkit\run.py", line 96, in main job.run() File "C:\AITOOLKIT\AI-Toolkit\jobs\ExtensionJob.py", line 22, in run process.run() File "C:\AITOOLKIT\AI-Toolkit\jobs\process\BaseSDTrainProcess.py", line 1564, in run self.sd.load_model() File "C:\AITOOLKIT\AI-Toolkit\extensions_built_in\diffusion_models\wan22\wan22_14b_model.py", line 227, in load_model super().load_model() File "C:\AITOOLKIT\AI-Toolkit\toolkit\models\wan21\wan21.py", line 401, in load_model transformer = self.load_wan_transformer( File "C:\AITOOLKIT\AI-Toolkit\extensions_built_in\diffusion_models\wan22\wan22_14b_model.py", line 326, in load_wan_transformer quantize_model(self, transformer_2) File "C:\AITOOLKIT\AI-Toolkit\toolkit\util\quantize.py", line 308, in quantize_model block.to("cpu", non_blocking=True) File "C:\AITOOLKIT\AI-Toolkit\venv\lib\site-packages\torch\nn\modules\module.py", line 1369, in to return self._apply(convert) File "C:\AITOOLKIT\AI-Toolkit\venv\lib\site-packages\torch\nn\modules\module.py", line 928, in _apply module._apply(fn) File "C:\AITOOLKIT\AI-Toolkit\venv\lib\site-packages\torch\nn\modules\module.py", line 928, in _apply module._apply(fn) File "C:\AITOOLKIT\AI-Toolkit\venv\lib\site-packages\torch\nn\modules\module.py", line 928, in _apply module._apply(fn) File "C:\AITOOLKIT\AI-Toolkit\venv\lib\site-packages\torch\nn\modules\module.py", line 955, in _apply param_applied = fn(param) File "C:\AITOOLKIT\AI-Toolkit\venv\lib\site-packages\torch\nn\modules\module.py", line 1355, in convert return t.to( File "C:\AITOOLKIT\AI-Toolkit\venv\lib\site-packages\optimum\quanto\tensor\qtensor.py", line 93, in torch_function return func(*args, **kwargs) File "C:\AITOOLKIT\AI-Toolkit\venv\lib\site-packages\optimum\quanto\tensor\qbytes.py", line 130, in torch_dispatch return qdispatch(*args, **kwargs) File "C:\AITOOLKIT\AI-Toolkit\venv\lib\site-packages\optimum\quanto\tensor\qbytes_ops.py", line 64, in _to_copy out_data = op(t._data, dtype=t._data.dtype, **kwargs) File "C:\AITOOLKIT\AI-Toolkit\venv\lib\site-packages\torch_ops.py", line 1243, in call return self._op(*args, **kwargs) torch.AcceleratorError: CUDA error: out of memory Compile with TORCH_USE_CUDA_DSAto enable device-side assertions.
C:\AITOOLKIT\AI-Toolkit>`
same issue. somethings broken
Same here. My last training around one week ago was fine with similar parameters. Now the CPU memory (64 GB) is not enough with 8 bit during Transformer 2. I have managed to start training using 4bit with ARA in transformers quantization, rather than 8 bit.
Update1: My last training was October 5th, and this was first reported 9th. So, some commit between those dates to blame.
Update2: I have rolled back to commit c6edd71 (Oct 1st) and it works fine. No tested further commits so far.
Update 3: Commit 4e57078 (Oct 5th 2025) is the one creating the problem.
Thanks @jmgmcm, rollback to Oct 1st commit fixed my CUDA out of memory issues that started last week. In my case, qwen-image training on 4070 Ti Super 16 GB. I had tried all the new memory offload features, but the only workaround was lower quants. With rollback I can run 8-bit again and it runs faster.
Same here OOM.
Quantizing Transformer 2
- quantizing 40 transformer blocks
48%|####7 | 19/40 [01:29<01:39, 4.72s/it]
Error running job: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Same here, been having issues for 2 weeks with this and training lora's on Wan2.2 on a 5090. Jobs that would run flawless before now OOM
CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Since updating Ai-Toolkit today, I can no longer create Wan2.2 14B LoRA—always OOM (same settings as before) ... RTX 5090 - 32GB VRAM 128GB RAM
Same here, constant OOM now, was running smooth before updating.
By jmgmcm Update2: I have rolled back to commit https://github.com/ostris/ai-toolkit/commit/c6edd71a5bb36f3dffcc8b56ee07cacaee14ab56 (Oct 1st) and it works fine. No tested further commits so far.
Thats worked for me :) big thanks
Yeah but how do you do that with fx a Gf5060 when you need ramtorch??? :)
By jmgmcm Update2: I have rolled back to commit c6edd71 (Oct 1st) and it works fine. No tested further commits so far.
Thats worked for me :) big thanks
Useful, after switching versions, you may need to reinstall the python dependencies, and then recompile the WebUI:
git reset --hard c6edd71
pip install -r requirements.txt
cd ui
npm run build_and_start
For me, this was actually another issue: bad settings when trying to train a lora. I wish the developer would disable any incompatible setting for a specific type of training.
Hent BlueMail til Android
Den 3. nov. 2025, 14.37, fra 14.37, "WJ · Zheng" @.***> skrev:
mailzwj left a comment (ostris/ai-toolkit#457)
By jmgmcm Update2: I have rolled back to commit c6edd71 (Oct 1st) and it works fine. No tested further commits so far.
Thats worked for me :) big thanks
Useful, after switching versions, you may need to reinstall the python dependencies, and then recompile the WebUI:
git reset --hard c6edd71 pip install -r requirements.txt cd ui npm run build_and_start-- Reply to this email directly or view it on GitHub: https://github.com/ostris/ai-toolkit/issues/457#issuecomment-3480570397 You are receiving this because you commented.
Message ID: @.***>
I also had this problem when starting training in QWEN; it would just keep thinking but nothing would happen.
The problem with reverting to a previous version is that we don't get the new features. Hopefully, the developer can fix this issue.
Add me to the list, started using Ostris late Oct and get this error everytime low VRAM flag is enabled using a 5090 for Wan 2.2 14b.
T2V works fine without low VRAM but gets a cuda OOM when enabled. I2V fails on anything higher than 256 so no option but enable low VRAM, which subsequently crashes.
Current version (26f4f02) works fine again (64 GB RAM RTX5090). Thank you to the developers.