ai-toolkit Hidream Lora Inference Not working as expected

I trained lora on hidream, for which the validation outputs were good, but inference is not looking good.

I am using the following code:

import os
import yaml
import argparse
import torch
from optimum.quanto import freeze, qfloat8, quantize
import torch
from transformers import PreTrainedTokenizerFast, LlamaForCausalLM
from diffusers import UniPCMultistepScheduler, HiDreamImagePipeline


def main(prompt_file: str, lora_path: str, char_prompt: str, output_dir: str):
    os.makedirs(output_dir, exist_ok=True)

    # Load prompts
    with open(prompt_file, "r", encoding="utf-8") as f:
        prompts = [line.strip() for line in f if line.strip()]
    
    tokenizer_4 = PreTrainedTokenizerFast.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
    text_encoder_4 = LlamaForCausalLM.from_pretrained(
        "meta-llama/Llama-3.1-8B-Instruct",
        output_hidden_states=True,
        output_attentions=True,
        torch_dtype=torch.float16,   # <-- change here
    )


    pipe = HiDreamImagePipeline.from_pretrained(
        "HiDream-ai/HiDream-I1-Full",
        tokenizer_4=tokenizer_4,
        text_encoder_4=text_encoder_4,
        torch_dtype=torch.float16,
    ).to("cuda", torch.bfloat16)

    pipe.load_lora_weights(lora_path)
    pipe.fuse_lora(lora_scale=1)
    pipe.to("cuda")

    quantize(pipe.transformer, weights=qfloat8)
    freeze(pipe.transformer)

    # Generate and save images
    for idx, prompt in enumerate(prompts):
        prompt = f"{char_prompt} {prompt}"
        print(f"PROMPT:{prompt}")
        image = pipe(
            prompt,
            height=1024, 
            width=1024,
            guidance_scale=4.5,
            num_inference_steps=50,
        ).images[0]
        image_path = os.path.join(output_dir, f"image_{idx+1}.png")
        image.save(image_path)
        print(f"Saved: {image_path}")

Validation outputs

Inference outputs

Apr 28 '25 12:04 omrastogi

It seems to me that load_lora_weights is not available for hidream in the diffusers implementation. But is there another way, I can get these inferences?

Apr 28 '25 13:04 omrastogi

@omrastogi how did you train the lora for hidream?

Apr 30 '25 17:04 joeyism

Hi @joeyism

I am sharing the configuration that I used to train these images.

job: extension
config:
  name: VoxStyle_Hidream
  process:
  - type: sd_trainer
    training_folder: output/
    performance_log_every: 1000
    device: cuda:0
    network:
      type: lora
      linear: 64
      linear_alpha: 64
    save:
      dtype: bfloat16
      save_every: 1000
      max_step_saves_to_keep: 1
      push_to_hub: false
    datasets:
    - folder_path: /mnt/data/om/lora_dataset/VoxMachina
      caption_ext: txt
      caption_dropout_rate: 0.0
      shuffle_tokens: false
      cache_latents_to_disk: true
      resolution:
      - 512
      - 768
      - 1024
    train:
      batch_size: 1
      steps: 5000
      gradient_accumulation_steps: 1
      train_unet: true
      train_text_encoder: false
      gradient_checkpointing: true
      noise_scheduler: flowmatch
      timestep_type: shift
      optimizer: adamw8bit
      lr: 1e-5
      ema_config:
        use_ema: true
        ema_decay: 0.99
      dtype: bf16
    model:
      name_or_path: HiDream-ai/HiDream-I1-Full
      extras_name_or_path: "HiDream-ai/HiDream-I1-Full"
      arch: "hidream"
      quantize: true
      quantize_te: true
      model_kwargs:
        llama_model_path: "unsloth/Meta-Llama-3.1-8B-Instruct"
    sample:
      sampler: flowmatch
      sample_every: 100
      width: 1024
      height: 1024
      prompts:
      - In voxStyle, a western-anime fusion with cel-shaded, dull lighting, expressive characters, detailed 2D backgrounds and mature fantasy tone, A young man with short light brown hair and a serious expression. He is wearing a dark coat with a white shirt and a gray tie. The background is dark with green and gold swirls. The lighting is soft and diffused, creating a gentle glow on his face. The man is centered in the image, with the background slightly out of focus.
      - In voxStyle, a western-anime fusion with cel-shaded, dull lighting, expressive characters, detailed 2D backgrounds and mature fantasy tone, An elf woman with platinum blonde hair, pointed ears, and a white and gold off-shoulder dress is seated at a formal dining table. She is turned slightly to her right, covering her mouth with one hand as if whispering or reacting discreetly. The table is set with formal cutlery and a folded napkin on a plate. 
      - In voxStyle, a western-anime fusion with cel-shaded, dull lighting, expressive characters, detailed 2D backgrounds and mature fantasy tone, A woman with short black hair and a white fur tail, wearing a blue and black outfit with a brown belt. She has pointed ears and a confident expression. She is standing in a dimly lit room with a dark curtain in the background. The lighting is soft and warm, casting gentle shadows. The woman has a slender physique and is gesturing with her right hand, as if making a gesture.
      - In voxStyle, a western-anime fusion with cel-shaded, dull lighting, expressive characters, detailed 2D backgrounds and mature fantasy tone, Two young women peeking out from behind a dark curtain. The woman on the left has light skin, brown hair, and a white bandage wrapped around her head. She has a small smile and is looking directly at the viewer. The other woman has dark skin and brown hair. Both women have large, expressive eyes. The background is simple and dark, with vertical black stripes. The lighting is soft and diffused, casting gentle shadows.
      neg: ''
      seed: 42
      walk_seed: true
      guidance_scale: 4
      sample_steps: 25
meta:
  name: '[name]'
  version: '1.0'

Apr 30 '25 17:04 omrastogi

@joeyism, any ideas how to infer the LORA weights?

Apr 30 '25 17:04 omrastogi