DiffSynth-Studio icon indicating copy to clipboard operation
DiffSynth-Studio copied to clipboard

CPU memory explosion during LoRA training — all training processes duplicate full model weights (no shared memory / lazy loading)

Open KAI4816 opened this issue 3 months ago • 1 comments

Hi team,

I encountered a severe CPU memory overflow issue when fine-tuning LoRA on the Qwen-Image-Edit-2509 model using the default examples/qwen_image/model_training/train.py script.

💻 Environment OS: Ubuntu 22.04 GPUs: 1 × RTX A6000 (48 GB) + 3 × RTX 3090 (24 GB each) CPU RAM: 62 GB CUDA: 12.9 PyTorch: 2.5.1 + cu124 Accelerate: 1.11.0 DeepSpeed: disabled (also tested ZeRO-2, same issue)

🧩 Problem summary

When running LoRA fine-tuning (e.g., Qwen-Image-Edit-2509), each distributed process individually loads a full copy of all 5 safetensors model shards, text encoder, and VAE — even though the processes are part of the same node.

That means with 4 GPUs (num_processes=4), → total model memory = 5 × (~25GB each) × 4 processes ≈ ~500GB+ CPU memory allocated.

Since the machine only has 62 GB of system RAM, all 4 processes get killed (SIGKILL) before the first step. This happens even when GPU VRAM stays < 10GB, confirming the crash occurs during model initialization on CPU.

  --dataset_base_path dataset \
  --dataset_metadata_path dataset/squat_edit.jsonl \
  --data_file_keys input,output \
  --extra_inputs edit \
  --train_batch_size 1 \
  --gradient_accumulation_steps 8 \
  --max_pixels 1048576 \
  --dataset_repeat 2 \
  --model_paths '[
    "DiffSynth-Studio/models/Qwen/Qwen-Image-Edit-2509/transformer/diffusion_pytorch_model-00001-of-00005.safetensors",
    ...
  ]' \
  --learning_rate 1e-4 \
  --num_epochs 1 \
  --lora_base_model dit \
  --lora_target_modules to_q,to_k,to_v,to_out.0,to_add_out,img_mlp.net.2,img_mod.1,txt_mlp.net.2,txt_mod.1 \
  --lora_rank 8 \
  --use_gradient_checkpointing \
  --dataset_num_workers 1 \
  --find_unused_parameters

📉 Behavior observed

As soon as the 4 processes are spawned, each starts loading full model weights independently. CPU RAM rises from ~5 GB → 62 GB in seconds → all processes killed (exit code -9). No GPU OOM, no batch iteration started. Inference with the same model works fine (only one process).

KAI4816 avatar Nov 06 '25 14:11 KAI4816

@KAI4816

  1. First of all, Qwen-Image LoRA training requires 80GB of GPU memory, which exceeds the capacity of your GPU and thus cannot proceed.
  2. Secondly, we will fix the issue of CPU loading and plan to release a major update within one month.

Artiprocher avatar Nov 11 '25 03:11 Artiprocher