ai-toolkit Qwen image training on 5090 running at slow speed then speeding up after the first sampling.

Qwen image training on 5090 running at slow speed then speeding up after the first sampling.

Aug 24 '25 14:08 Hakim3i

i got same problem

Aug 26 '25 00:08 capruokz

I'm having the same issue. I noticed in Task Manager that VRAM is already using shared memory, which might be the reason for the low performance. I think some settings need to be configured to reduce VRAM usage.

Aug 26 '25 07:08 66sama

I'm having the same issue. I noticed in Task Manager that VRAM is already using shared memory, which might be the reason for the low performance. I think some settings need to be configured to reduce VRAM usage.

I set the transformer to 3bit with LoRA, enabled cache text embeddings and cache latents, and now the performance is normal.

Aug 26 '25 08:08 66sama

hm! I only have 5070Ti but i run to the same issue. i have 71.5W out of 300.0W power drawn. I used 3bit with LoRA, enabled cache text embeddings and cache latents and even lowered the resolution to 768... is my card too puny for QWEN? :o

Sep 05 '25 15:09 Frytkownica

Yeah, on the 5090 I am also getting really fast speeds such as 3 sec/iter only for it to slow significantly to 70-90 sec/iter with low power draw which is unusual. I am wondering if is a power management option, I did install the studio drivers so I doubt it could be a driver issue.

Sep 09 '25 01:09 CodeDog04

UPDATE: I didn't figure it out. I hope this information helps someone else in an attempt to get this issue resolved.

In my Bios I changed these settings (I have three GPU's, train off 5090 and was having same issues): Native ASPM → Disabled CPU PCIE ASPM Mode Control → Disabled PCIEX16_1 Link Mode → GEN4 PCIEX16_2 Link Mode → GEN4 PCIEX16(G4) Link Mode → GEN4

I set the PCIE slots to GEN4, they were defaulting to GEN5. Unsure which of these settings did it, perhaps all of them, I run into no issues and train perfect on the 5090 now, average 2-5 iter/sec, no slowdown. After a few thousand iterations, it slows again only for the issue to persist. I will have to attempt more work on trying to see if the built in options could be a cause for this.

I am going to test the max step saves options some more. It appears that it still slows to a near halt if you save too often and have a big backup of saved steps.

Sep 10 '25 05:09 CodeDog04

I figured it out, At least for me I 100% figured it out this time! I hope this information helps someone else, took a lot of messing with settings.

In my Bios I changed these settings (I have three GPU's, train off 5090 and was having same issues): Native ASPM → Disabled CPU PCIE ASPM Mode Control → Disabled PCIEX16_1 Link Mode → GEN4 PCIEX16_2 Link Mode → GEN4 PCIEX16(G4) Link Mode → GEN4

I set the PCIE slots to GEN4, they were defaulting to GEN5. Unsure which of these settings did it, perhaps all of them, I run into no issues and train perfect on the 5090 now, average 2-5 iter/sec, no slowdown.

It seems you should keep it at save every 1000 and have a max step saves of 4 or so. It appears that it still slows to a near halt if you save too often and have a big backup of saved steps.

Problem is that it is using shared memory where it should not that's why it is slowing down, this also happens with wan2.2. I don't think it is related to bios or drivers have to to test other trainers to confirm.

Sep 10 '25 06:09 Hakim3i

I figured it out, At least for me I 100% figured it out this time! I hope this information helps someone else, took a lot of messing with settings. In my Bios I changed these settings (I have three GPU's, train off 5090 and was having same issues): Native ASPM → Disabled CPU PCIE ASPM Mode Control → Disabled PCIEX16_1 Link Mode → GEN4 PCIEX16_2 Link Mode → GEN4 PCIEX16(G4) Link Mode → GEN4 I set the PCIE slots to GEN4, they were defaulting to GEN5. Unsure which of these settings did it, perhaps all of them, I run into no issues and train perfect on the 5090 now, average 2-5 iter/sec, no slowdown. It seems you should keep it at save every 1000 and have a max step saves of 4 or so. It appears that it still slows to a near halt if you save too often and have a big backup of saved steps.

Problem is that it is using shared memory where it should not that's why it is slowing down, this also happens with wan2.2. I don't think it is related to bios or drivers have to to test other trainers to confirm.

I spoke too soon and the issue is back for me. Strange, this is driving me insane. I suppose your finding is the only reasonable conclusion to this issue. I will give another trainer a try to see if I also run into this problem. I am only attempting to train Qwen lora.

Sep 10 '25 06:09 CodeDog04

Hi, I’m experiencing the same issue on a 4090. It takes around 30–40 minutes just to start the first step, and then performance improves a bit, but power usage remains very unstable. Sometimes it’s fine, but other times it drops a lot, so the estimated training time jumps from ~5 hours up to 20 hours.

Sep 10 '25 20:09 bsalberto77

I have this same issue on my 5090 when training Qwen image loras. It starts out strong before dropping the power draw and slowing down dramatically.

Sep 12 '25 18:09 daflood

from what i have been able to understand - not using 100% of power is normal. Not all elements of the card are used to calculate this so power consumption should not be maxed out.

Sep 12 '25 19:09 Frytkownica

Right, it's not always going to be maxed out but I'm only seeing 33s/it on a 5090. Shouldn't be closer to 3.5?

Sep 12 '25 20:09 daflood

from what i have been able to understand - not using 100% of power is normal. Not all elements of the card are used to calculate this so power consumption should not be maxed out.

5090 should not be using 100watt anyway this issue has nothing to do with powerdraw it is a problem with overflowing, the trainer should not be using shared memory when the gpu is not oom.

The low powerdraw will happens if the trainer use shared memory.

Qwen start very slow because it use shared memory then after the first sampling is done it will speed up because it stop using shared memory which is very weird issue.

Sep 12 '25 21:09 Hakim3i

I saw on the discord that you can set the trainer to save more often which clears out the memory. I set it to every 75 steps from 250 and it seems to keep Windows from sharing the memory.

Sep 12 '25 23:09 daflood

@CodeDog04 bro, were you using multiple GPUs? I have 2*5090, then I modify the line "device: cuda:0, cuda:1" to train qwen-image-edit, but it didnt work, even I wanted to use only cuda1. Could you tell me how to modify config file to enable multi-GPU? THKS~

Sep 25 '25 09:09 matrix12315

The same problem occurs on the RTX 4090. At first, the process runs quickly, but then it completely stops. Instead of taking 3 hours, it ends up taking 30..

Sep 25 '25 09:09 pastuh

@CodeDog04 bro, were you using multiple GPUs? I have 2*5090, then I modify the line "device: cuda:0, cuda:1" to train qwen-image-edit, but it didnt work, even I wanted to use only cuda1. Could you tell me how to modify config file to enable multi-GPU? THKS~

I found the only solution was changing the settings to save every 50, max step saves to keep I set for 50 (change to whatever you want)

Sep 26 '25 02:09 CodeDog04

Same issue here on a 5090. Could not get it working for almost 2 days. I think it is a bad combination of settings that will break the performance. My current settings below with speeds between 5-7 sec / it running without any incidences for 4000 steps right now. For me it was the switch from 5bit quantization to 6bit that breaks the whole thing. Hope it helps anyone...

    "type": "lora",
    "linear": 64,
    "linear_alpha": 64,
    "conv": 16,
    "conv_alpha": 16,

    "dtype": "bf16",
    "save_every": 1000,
    "max_step_saves_to_keep": 20,
    "save_format": "diffusers",
    
        "mask_min_value": 0.1,
        "default_caption": "",
        "caption_ext": "txt",
        "caption_dropout_rate": 0.05,
        "cache_latents_to_disk": true,
        "is_reg": false,
        "network_weight": 1,
        "resolution": [
            1280,
            1024
        ],

    "batch_size": 1,
    "gradient_accumulation": 1,
    "steps": 15000,
    "train_unet": true,
    "train_text_encoder": false,
    "gradient_checkpointing": true,
    "noise_scheduler": "flowmatch",
    "timestep_type": "shift",
    "optimizer": "adamw8bit",
    "optimizer_params": {
        "weight_decay": 0.0001

    "bypass_guidance_embedding": false,
    "content_or_style": "content",
    "unload_text_encoder": false,
    "cache_text_embeddings": true,
    "lr": 0.0001,
    "ema_config": {
        "use_ema": false,
        "ema_decay": 0.999

    "skip_first_sample": true,
    "force_first_sample": false,
    "disable_sampling": false,
    "dtype": "bf16",
    "loss_type": "mse"

Sep 28 '25 22:09 LiquefyR

如果是Qwen-Image的话，以上办法都没用，需要说明的是，5090训练的速度非常快，但是如果运行测试图片，例如每500步测试一轮，如果运行了测试，就会一直占用一部分显存从而导致训练速度急剧下降，从500W降低到150W。解决办法！将测试步数延长，中间不要中断，例如3000步测试一次，3000步之后再次增加步数并重新训练。 lora保存间隔不会影响性能，只有测试会有影响。

Oct 16 '25 06:10 hero8152