DiffSynth-Studio icon indicating copy to clipboard operation
DiffSynth-Studio copied to clipboard

Will qwen_image_edit Support Batch Inference and DPO Training?

Open huxian0402 opened this issue 3 months ago • 4 comments

I tried to implement DPO training using LoRA with qwen_image_edit , but it seems that the current code does not support batch inference and also has high GPU memory usage, which makes DPO training quite difficult. May I ask if the official plan is to add support for DPO training in qwen_image_edit in the future? @pengqun @Lupino @calmhawk @co63oc @wenmengzhou Thank you.

huxian0402 avatar Sep 12 '25 11:09 huxian0402

@huxian0402

We currently do not plan to support batch sizes greater than 1, for the following reasons:

  • Larger batch sizes do not yield significant acceleration. Acceleration techniques such as Flash Attention have already maximized GPU utilization; thus, increasing the batch size only leads to higher VRAM consumption without delivering noticeable speed improvements.

  • Supporting batch size > 1 requires full-stack compatibility, which exceeds our team’s current manpower capacity. While enabling batch size > 1 for Qwen-Image LoRA training alone is straightforward, adding components like ControlNet, various Adapters, or heterogeneous inputs—even if just one component lacks batch size > 1 support—breaks the entire training pipeline:

    • For instance, Qwen-Image's text embeddings vary with input text length. Although padding and masking can align embeddings of different lengths, this introduces substantial maintenance overhead.
    • Similarly, after enabling image editing in Qwen-Image, input images may have varying resolutions, making it impossible to concatenate them into a single tensor.
  • Future models will grow increasingly large, to the point where even training with batch size = 1 on a single GPU becomes challenging. The engineering effort required to support larger batch sizes will quickly become obsolete. From FLUX to Qwen-Image, model parameters have grown from 12B to 20B—a clear trend toward ever-larger models. Therefore:

    • We prefer to focus our efforts on multi-GPU parallelism, including Tensor Parallelism and Sequential Parallelism, as well as pretraining-optimized frameworks like Megatron.
  • There are currently many equivalent alternatives to batch size > 1, including:

    • Multi-GPU or multi-machine training, where global batch size equals the number of GPUs.
    • Gradient Accumulation:
      • Mathematically, gradient accumulation is equivalent to using a larger batch size. You can verify this by implementing a simple neural network framework similar to PyTorch. The only critical adjustment is scaling the learning rate by dividing it by the number of accumulation steps (for SGD), or alternatively, scaling the loss by the same factor.
      • The main challenge with gradient accumulation today lies in numerical precision: accumulating gradients over multiple steps expands the dynamic range of gradient values. This issue is especially pronounced with bf16 precision, since current training frameworks (e.g., Accelerator) lack mechanisms to scale bf16 gradients during accumulation, leading to precision degradation (float32 performs significantly better). However, this limitation can be mitigated through deeper optimizations within the underlying training framework.

Artiprocher avatar Sep 23 '25 05:09 Artiprocher

@huxian0402 DPO training is still under development and is expected to be completed within approximately one month.

Artiprocher avatar Sep 23 '25 05:09 Artiprocher

@Artiprocher Any updates on DPO?

nom avatar Oct 26 '25 03:10 nom

+1

dongdk avatar Nov 17 '25 02:11 dongdk

@Artiprocher Do you have any plans on when to release the training code for DPO? Thank you for your reply.

xyxxmb avatar Dec 08 '25 11:12 xyxxmb