Training speed degrades significantly with GPU power and temperature drop
This is for bugs only
Did you already ask in the discord?
Yes
You verified that this is a bug and not a feature request or question by asking in the discord?
Yes
Describe the bug
Configuration: Windows 11 25H2 Python 3.12.10 PyTorch 2.8.0 Running lastest version of ai-toolkit (commit 9b89bab)
Hardware: 1x RTX 3060 12 GB (Driver Version: 581.15, CUDA Version: 13.0 ) 64 GB DDR4 RAM
Training config:
job: extension
config:
name: test128
process:
- type: diffusion_trainer
training_folder: C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\output
sqlite_db_path: C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\aitk_db.db
device: cuda
trigger_word: null
performance_log_every: 10
network:
type: lora
linear: 128
linear_alpha: 128
conv: 16
conv_alpha: 16
lokr_full_rank: true
lokr_factor: -1
network_kwargs:
ignore_if_contains: []
save:
dtype: bf16
save_every: 250
max_step_saves_to_keep: 4
save_format: diffusers
push_to_hub: false
datasets:
- folder_path: C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\datasets/target
mask_path: null
mask_min_value: 0.1
default_caption: put design on the shirt
caption_ext: txt
caption_dropout_rate: 0.05
cache_latents_to_disk: false
is_reg: false
network_weight: 1
resolution:
- 512
- 768
- 1024
controls: []
shrink_video_to_frames: true
num_frames: 1
do_i2v: true
flip_x: false
flip_y: false
control_path_1: C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\datasets/control
control_path_2: null
control_path_3: null
train:
batch_size: 1
bypass_guidance_embedding: false
steps: 7000
gradient_accumulation: 1
train_unet: true
train_text_encoder: false
gradient_checkpointing: true
noise_scheduler: flowmatch
optimizer: adamw8bit
timestep_type: weighted
content_or_style: balanced
optimizer_params:
weight_decay: 0.0001
unload_text_encoder: false
cache_text_embeddings: true
lr: 0.0001
ema_config:
use_ema: false
ema_decay: 0.99
skip_first_sample: true
force_first_sample: false
disable_sampling: true
dtype: bf16
diff_output_preservation: false
diff_output_preservation_multiplier: 1
diff_output_preservation_class: person
switch_boundary_every: 1
loss_type: mse
model:
name_or_path: Qwen/Qwen-Image-Edit-2509
quantize: true
qtype: uint3|ostris/accuracy_recovery_adapters/qwen_image_edit_2509_torchao_uint3.safetensors
quantize_te: true
qtype_te: qfloat8
arch: qwen_image_edit_plus
low_vram: true
model_kwargs:
match_target_res: true
layer_offloading: true
layer_offloading_text_encoder_percent: 1
layer_offloading_transformer_percent: 1
sample:
sampler: flowmatch
sample_every: 250
width: 1024
height: 1024
samples: []
neg: ''
seed: 42
walk_seed: true
guidance_scale: 4
sample_steps: 25
num_frames: 1
fps: 1
meta:
name: test128
version: '1.0'
Issue: When training a rank 128 Qwen Edit 2509 LoRA, the training speed becomes highly inconsistent and slows down dramatically after a few epochs. Initially, it runs at around 56 seconds per iteration, but after some time, it degrades to ~200 seconds per iteration or more. Fresh install doesn't fix it. Training LoRa with rank 64 or less works fine.
There are:
- No out-of-memory (OOM) errors
- No SSD swap usage
- GPU load remains at around 100%
- VRAM and RAM usage stay constant
However, during this slowdown:
- GPU power draw and temperature drop sharply, indicating reduced actual compute utilization even though usage metrics report around 100%
- Training speed slows down dramatically, making progress extremely inefficient
Typical training behavior:
- ~55 °C GPU temperature
- ~115 W power draw
Problematic behavior:
- Noticeable drop in both temperature and power draw
- GPU still reports around 100% utilization
- Dramatic slowdown in training performance
Related issues: https://github.com/ostris/ai-toolkit/issues/390