Training speed degrades significantly with GPU power and temperature drop

Open meknidirta opened this issue 1 month ago • 18 comments

This is for bugs only

Did you already ask in the discord?

Yes

You verified that this is a bug and not a feature request or question by asking in the discord?

Yes

Describe the bug

Configuration: Windows 11 25H2 Python 3.12.10 PyTorch 2.8.0 Running lastest version of ai-toolkit (commit 9b89bab)

Hardware: 1x RTX 3060 12 GB (Driver Version: 581.15, CUDA Version: 13.0 ) 64 GB DDR4 RAM

Training config:

job: extension
config:
  name: test128
  process:
  - type: diffusion_trainer
    training_folder: C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\output
    sqlite_db_path: C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\aitk_db.db
    device: cuda
    trigger_word: null
    performance_log_every: 10
    network:
      type: lora
      linear: 128
      linear_alpha: 128
      conv: 16
      conv_alpha: 16
      lokr_full_rank: true
      lokr_factor: -1
      network_kwargs:
        ignore_if_contains: []
    save:
      dtype: bf16
      save_every: 250
      max_step_saves_to_keep: 4
      save_format: diffusers
      push_to_hub: false
    datasets:
    - folder_path: C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\datasets/target
      mask_path: null
      mask_min_value: 0.1
      default_caption: put design on the shirt
      caption_ext: txt
      caption_dropout_rate: 0.05
      cache_latents_to_disk: false
      is_reg: false
      network_weight: 1
      resolution:
      - 512
      - 768
      - 1024
      controls: []
      shrink_video_to_frames: true
      num_frames: 1
      do_i2v: true
      flip_x: false
      flip_y: false
      control_path_1: C:\Users\Mekni\Desktop\hybridaitoolkit\AI-Toolkit\datasets/control
      control_path_2: null
      control_path_3: null
    train:
      batch_size: 1
      bypass_guidance_embedding: false
      steps: 7000
      gradient_accumulation: 1
      train_unet: true
      train_text_encoder: false
      gradient_checkpointing: true
      noise_scheduler: flowmatch
      optimizer: adamw8bit
      timestep_type: weighted
      content_or_style: balanced
      optimizer_params:
        weight_decay: 0.0001
      unload_text_encoder: false
      cache_text_embeddings: true
      lr: 0.0001
      ema_config:
        use_ema: false
        ema_decay: 0.99
      skip_first_sample: true
      force_first_sample: false
      disable_sampling: true
      dtype: bf16
      diff_output_preservation: false
      diff_output_preservation_multiplier: 1
      diff_output_preservation_class: person
      switch_boundary_every: 1
      loss_type: mse
    model:
      name_or_path: Qwen/Qwen-Image-Edit-2509
      quantize: true
      qtype: uint3|ostris/accuracy_recovery_adapters/qwen_image_edit_2509_torchao_uint3.safetensors
      quantize_te: true
      qtype_te: qfloat8
      arch: qwen_image_edit_plus
      low_vram: true
      model_kwargs:
        match_target_res: true
      layer_offloading: true
      layer_offloading_text_encoder_percent: 1
      layer_offloading_transformer_percent: 1
    sample:
      sampler: flowmatch
      sample_every: 250
      width: 1024
      height: 1024
      samples: []
      neg: ''
      seed: 42
      walk_seed: true
      guidance_scale: 4
      sample_steps: 25
      num_frames: 1
      fps: 1
meta:
  name: test128
  version: '1.0'

Issue: When training a rank 128 Qwen Edit 2509 LoRA, the training speed becomes highly inconsistent and slows down dramatically after a few epochs. Initially, it runs at around 56 seconds per iteration, but after some time, it degrades to ~200 seconds per iteration or more. Fresh install doesn't fix it. Training LoRa with rank 64 or less works fine.

There are:

No out-of-memory (OOM) errors
No SSD swap usage
GPU load remains at around 100%
VRAM and RAM usage stay constant

However, during this slowdown:

GPU power draw and temperature drop sharply, indicating reduced actual compute utilization even though usage metrics report around 100%
Training speed slows down dramatically, making progress extremely inefficient

Typical training behavior:

~55 °C GPU temperature
~115 W power draw

Problematic behavior:

Noticeable drop in both temperature and power draw
GPU still reports around 100% utilization
Dramatic slowdown in training performance

Related issues: https://github.com/ostris/ai-toolkit/issues/390

Nov 09 '25 18:11 meknidirta