ai-toolkit icon indicating copy to clipboard operation
ai-toolkit copied to clipboard

Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Open brazilgithub opened this issue 2 months ago • 15 comments

I managed to install the previous version and trained for a few days. I had excellent results.

Now, using the same settings, I can no longer start the training. I’m using an RTX 5090. I also reinstalled it on two other computers — one with an RTX 4090 and another with an RTX 3090. The same error happens on all of them, regardless of the number of training images for a Wan 2.2 or the configurations.

Does anyone know how to fix this? I haven’t modified anything. I believe the executable file may have caused AITOOLKIT to run and update itself, creating some incompatibility. I’ve reinstalled it several times, but I keep getting the same error.

`Running 1 job W1009 22:44:38.061000 48028 venv\Lib\site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs. { "type": "diffusion_trainer", "training_folder": "C:\AITOOLKIT\AI-Toolkit\output", "sqlite_db_path": "C:\AITOOLKIT\AI-Toolkit\aitk_db.db", "device": "cuda", "trigger_word": "brunnosarttori", "performance_log_every": 10, "network": { "type": "lora", "linear": 32, "linear_alpha": 32, "conv": 16, "conv_alpha": 16, "lokr_full_rank": true, "lokr_factor": -1, "network_kwargs": { "ignore_if_contains": [] } }, "save": { "dtype": "bf16", "save_every": 250, "max_step_saves_to_keep": 4, "save_format": "diffusers", "push_to_hub": false }, "datasets": [ { "folder_path": "C:\AITOOLKIT\AI-Toolkit\datasets/dsdfsdf", "mask_path": null, "mask_min_value": 0.1, "default_caption": "", "caption_ext": "txt", "caption_dropout_rate": 0.05, "cache_latents_to_disk": false, "is_reg": false, "network_weight": 1, "resolution": [ 512, 768, 1024 ], "controls": [], "shrink_video_to_frames": true, "num_frames": 1, "do_i2v": true, "flip_x": false, "flip_y": false } ], "train": { "batch_size": 1, "bypass_guidance_embedding": false, "steps": 3000, "gradient_accumulation": 1, "train_unet": true, "train_text_encoder": false, "gradient_checkpointing": true, "noise_scheduler": "flowmatch", "optimizer": "adamw8bit", "timestep_type": "linear", "content_or_style": "balanced", "optimizer_params": { "weight_decay": 0.0001 }, "unload_text_encoder": false, "cache_text_embeddings": false, "lr": 0.0001, "ema_config": { "use_ema": false, "ema_decay": 0.99 }, "skip_first_sample": false, "force_first_sample": false, "disable_sampling": false, "dtype": "bf16", "diff_output_preservation": false, "diff_output_preservation_multiplier": 1, "diff_output_preservation_class": "person", "switch_boundary_every": 1, "loss_type": "mse" }, "model": { "name_or_path": "ai-toolkit/Wan2.2-T2V-A14B-Diffusers-bf16", "quantize": true, "qtype": "qfloat8", "quantize_te": true, "qtype_te": "qfloat8", "arch": "wan22_14b:t2v", "low_vram": true, "model_kwargs": { "train_high_noise": false, "train_low_noise": true } }, "sample": { "sampler": "flowmatch", "sample_every": 250, "width": 1024, "height": 1024, "samples": [ { "prompt": "brunnosarttori with red hair, playing chess at the park, bomb going off in the background" } ], "neg": "", "seed": 42, "walk_seed": true, "guidance_scale": 4, "sample_steps": 25, "num_frames": 1, "fps": 16 } } Using SQLite database at C:\AITOOLKIT\AI-Toolkit\aitk_db.db Job ID: "6b06af2e-bb35-4360-add9-4ca0b12c3e42"

#############################################

Running job: my_first_lora_v1

#############################################

Running 1 process Loading Wan model Loading transformer 1 Loading checkpoint shards: 100%|#########################################################| 3/3 [00:00<00:00, 20.69it/s] Quantizing Transformer 1

  • quantizing 40 transformer blocks 100%|##################################################################################| 40/40 [00:45<00:00, 1.14s/it]
  • quantizing extras Moving transformer 1 to CPU Loading transformer 2 Loading checkpoint shards: 100%|#########################################################| 3/3 [00:00<00:00, 5.64it/s] Quantizing Transformer 2
  • quantizing 40 transformer blocks 55%|#############################################1 | 22/40 [00:44<00:36, 2.01s/it] Error running job: CUDA error: out of memory Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

======================================== Result:

  • 0 completed jobs
  • 1 failure ======================================== Traceback (most recent call last): File "C:\AITOOLKIT\AI-Toolkit\run.py", line 120, in main() File "C:\AITOOLKIT\AI-Toolkit\run.py", line 108, in main raise e File "C:\AITOOLKIT\AI-Toolkit\run.py", line 96, in main job.run() File "C:\AITOOLKIT\AI-Toolkit\jobs\ExtensionJob.py", line 22, in run process.run() File "C:\AITOOLKIT\AI-Toolkit\jobs\process\BaseSDTrainProcess.py", line 1564, in run self.sd.load_model() File "C:\AITOOLKIT\AI-Toolkit\extensions_built_in\diffusion_models\wan22\wan22_14b_model.py", line 227, in load_model super().load_model() File "C:\AITOOLKIT\AI-Toolkit\toolkit\models\wan21\wan21.py", line 401, in load_model transformer = self.load_wan_transformer( File "C:\AITOOLKIT\AI-Toolkit\extensions_built_in\diffusion_models\wan22\wan22_14b_model.py", line 326, in load_wan_transformer quantize_model(self, transformer_2) File "C:\AITOOLKIT\AI-Toolkit\toolkit\util\quantize.py", line 308, in quantize_model block.to("cpu", non_blocking=True) File "C:\AITOOLKIT\AI-Toolkit\venv\lib\site-packages\torch\nn\modules\module.py", line 1369, in to return self._apply(convert) File "C:\AITOOLKIT\AI-Toolkit\venv\lib\site-packages\torch\nn\modules\module.py", line 928, in _apply module._apply(fn) File "C:\AITOOLKIT\AI-Toolkit\venv\lib\site-packages\torch\nn\modules\module.py", line 928, in _apply module._apply(fn) File "C:\AITOOLKIT\AI-Toolkit\venv\lib\site-packages\torch\nn\modules\module.py", line 928, in _apply module._apply(fn) File "C:\AITOOLKIT\AI-Toolkit\venv\lib\site-packages\torch\nn\modules\module.py", line 955, in _apply param_applied = fn(param) File "C:\AITOOLKIT\AI-Toolkit\venv\lib\site-packages\torch\nn\modules\module.py", line 1355, in convert return t.to( File "C:\AITOOLKIT\AI-Toolkit\venv\lib\site-packages\optimum\quanto\tensor\qtensor.py", line 93, in torch_function return func(*args, **kwargs) File "C:\AITOOLKIT\AI-Toolkit\venv\lib\site-packages\optimum\quanto\tensor\qbytes.py", line 130, in torch_dispatch return qdispatch(*args, **kwargs) File "C:\AITOOLKIT\AI-Toolkit\venv\lib\site-packages\optimum\quanto\tensor\qbytes_ops.py", line 64, in _to_copy out_data = op(t._data, dtype=t._data.dtype, **kwargs) File "C:\AITOOLKIT\AI-Toolkit\venv\lib\site-packages\torch_ops.py", line 1243, in call return self._op(*args, **kwargs) torch.AcceleratorError: CUDA error: out of memory Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

C:\AITOOLKIT\AI-Toolkit>`

brazilgithub avatar Oct 10 '25 01:10 brazilgithub

same issue. somethings broken

lycos2019 avatar Oct 10 '25 20:10 lycos2019

Same here. My last training around one week ago was fine with similar parameters. Now the CPU memory (64 GB) is not enough with 8 bit during Transformer 2. I have managed to start training using 4bit with ARA in transformers quantization, rather than 8 bit.

Update1: My last training was October 5th, and this was first reported 9th. So, some commit between those dates to blame.

Update2: I have rolled back to commit c6edd71 (Oct 1st) and it works fine. No tested further commits so far.

Update 3: Commit 4e57078 (Oct 5th 2025) is the one creating the problem.

jmgmcm avatar Oct 11 '25 21:10 jmgmcm

Thanks @jmgmcm, rollback to Oct 1st commit fixed my CUDA out of memory issues that started last week. In my case, qwen-image training on 4070 Ti Super 16 GB. I had tried all the new memory offload features, but the only workaround was lower quants. With rollback I can run 8-bit again and it runs faster.

pcl04dl3tt3r avatar Oct 13 '25 06:10 pcl04dl3tt3r

Same here OOM.

Quantizing Transformer 2
 - quantizing 40 transformer blocks
 48%|####7     | 19/40 [01:29<01:39,  4.72s/it]
Error running job: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

DefinitlyEvil avatar Oct 16 '25 22:10 DefinitlyEvil

Same here, been having issues for 2 weeks with this and training lora's on Wan2.2 on a 5090. Jobs that would run flawless before now OOM

CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Jensen144 avatar Oct 19 '25 09:10 Jensen144

Since updating Ai-Toolkit today, I can no longer create Wan2.2 14B LoRA—always OOM (same settings as before) ... RTX 5090 - 32GB VRAM 128GB RAM

Syscore64 avatar Oct 21 '25 03:10 Syscore64

Same here, constant OOM now, was running smooth before updating.

Galvo87 avatar Oct 21 '25 05:10 Galvo87

By jmgmcm Update2: I have rolled back to commit https://github.com/ostris/ai-toolkit/commit/c6edd71a5bb36f3dffcc8b56ee07cacaee14ab56 (Oct 1st) and it works fine. No tested further commits so far.

Thats worked for me :) big thanks

Syscore64 avatar Oct 21 '25 06:10 Syscore64

Yeah but how do you do that with fx a Gf5060 when you need ramtorch??? :)

cooperdk avatar Oct 30 '25 18:10 cooperdk

By jmgmcm Update2: I have rolled back to commit c6edd71 (Oct 1st) and it works fine. No tested further commits so far.

Thats worked for me :) big thanks

Useful, after switching versions, you may need to reinstall the python dependencies, and then recompile the WebUI:

git reset --hard c6edd71
pip install -r requirements.txt

cd ui
npm run build_and_start

mailzwj avatar Nov 03 '25 13:11 mailzwj

For me, this was actually another issue: bad settings when trying to train a lora. I wish the developer would disable any incompatible setting for a specific type of training.

⁣Hent BlueMail til Android ​

Den 3. nov. 2025, 14.37, fra 14.37, "WJ · Zheng" @.***> skrev:

mailzwj left a comment (ostris/ai-toolkit#457)

By jmgmcm Update2: I have rolled back to commit c6edd71 (Oct 1st) and it works fine. No tested further commits so far.

Thats worked for me :) big thanks

Useful, after switching versions, you may need to reinstall the python dependencies, and then recompile the WebUI:

git reset --hard c6edd71
pip install -r requirements.txt

cd ui
npm run build_and_start

-- Reply to this email directly or view it on GitHub: https://github.com/ostris/ai-toolkit/issues/457#issuecomment-3480570397 You are receiving this because you commented.

Message ID: @.***>

cooperdk avatar Nov 03 '25 15:11 cooperdk

I also had this problem when starting training in QWEN; it would just keep thinking but nothing would happen.

The problem with reverting to a previous version is that we don't get the new features. Hopefully, the developer can fix this issue.

Antonio-Luis-LDM avatar Nov 08 '25 11:11 Antonio-Luis-LDM

Add me to the list, started using Ostris late Oct and get this error everytime low VRAM flag is enabled using a 5090 for Wan 2.2 14b.

T2V works fine without low VRAM but gets a cuda OOM when enabled. I2V fails on anything higher than 256 so no option but enable low VRAM, which subsequently crashes.

99-bolt avatar Nov 09 '25 22:11 99-bolt

Current version (26f4f02) works fine again (64 GB RAM RTX5090). Thank you to the developers.

jmgmcm avatar Dec 13 '25 16:12 jmgmcm