CPU memory explosion during LoRA training — all training processes duplicate full model weights (no shared memory / lazy loading)
Hi team,
I encountered a severe CPU memory overflow issue when fine-tuning LoRA on the Qwen-Image-Edit-2509 model using the default examples/qwen_image/model_training/train.py script.
💻 Environment OS: Ubuntu 22.04 GPUs: 1 × RTX A6000 (48 GB) + 3 × RTX 3090 (24 GB each) CPU RAM: 62 GB CUDA: 12.9 PyTorch: 2.5.1 + cu124 Accelerate: 1.11.0 DeepSpeed: disabled (also tested ZeRO-2, same issue)
🧩 Problem summary
When running LoRA fine-tuning (e.g., Qwen-Image-Edit-2509), each distributed process individually loads a full copy of all 5 safetensors model shards, text encoder, and VAE — even though the processes are part of the same node.
That means with 4 GPUs (num_processes=4), → total model memory = 5 × (~25GB each) × 4 processes ≈ ~500GB+ CPU memory allocated.
Since the machine only has 62 GB of system RAM, all 4 processes get killed (SIGKILL) before the first step. This happens even when GPU VRAM stays < 10GB, confirming the crash occurs during model initialization on CPU.
--dataset_base_path dataset \
--dataset_metadata_path dataset/squat_edit.jsonl \
--data_file_keys input,output \
--extra_inputs edit \
--train_batch_size 1 \
--gradient_accumulation_steps 8 \
--max_pixels 1048576 \
--dataset_repeat 2 \
--model_paths '[
"DiffSynth-Studio/models/Qwen/Qwen-Image-Edit-2509/transformer/diffusion_pytorch_model-00001-of-00005.safetensors",
...
]' \
--learning_rate 1e-4 \
--num_epochs 1 \
--lora_base_model dit \
--lora_target_modules to_q,to_k,to_v,to_out.0,to_add_out,img_mlp.net.2,img_mod.1,txt_mlp.net.2,txt_mod.1 \
--lora_rank 8 \
--use_gradient_checkpointing \
--dataset_num_workers 1 \
--find_unused_parameters
📉 Behavior observed
As soon as the 4 processes are spawned, each starts loading full model weights independently. CPU RAM rises from ~5 GB → 62 GB in seconds → all processes killed (exit code -9). No GPU OOM, no batch iteration started. Inference with the same model works fine (only one process).
@KAI4816
- First of all, Qwen-Image LoRA training requires 80GB of GPU memory, which exceeds the capacity of your GPU and thus cannot proceed.
- Secondly, we will fix the issue of CPU loading and plan to release a major update within one month.