ai-toolkit [Sharing Experience] Training Z-Image LoRA using 12G VRAM ~ 😁

(1) Datasets Training Preparation: Image material with a maximum side size of 768.

🎉 To minimize VRAM usage, after extensive testing, training was successful with 6~10 images, yielding good LoRa results.

A maximum image side resolution of 1024 might cause insufficient 12G VRAM. You can try it!

(2) New Job Creation: Select Z-Image Turbo, and then set your model path. Follow the settings in the screenshot below.

You need to enter your own trigger words! This is just an example !
⚠️ Remember to set the Transformer Offload to 0%; we won't use it because it will throw an error, and we're unsure if this is a bug.

Correction: Learning Rate 0.0001 ~ 0.0002 !
👉 After extensive testing, 2000 steps is a suitable value.

Datasets: Cache Latents and Resolution 512, 768 ~

A maximum image side resolution of 1024 might cause insufficient 12G VRAM. You can try it!

👆 As you can see, training started successfully on 12G VRAM, and the speed is quite good !👇

first_lora_v1:  26%|##5       | 519/2000 [24:43<1:10:54,  2.87s/it, lr: 2.0e-04 loss: 3.811e-01]

Training speed is approximately 2~3 seconds/it, 2000 steps take about 1 - 2 hour to complete.

Finally: Wishing users with low VRAM success in training their own z-image LoRA!
Thanks to ai-toolkit and z-image, have fun! If you have better training settings, please share! 🤗

This is my second z-image LoRA. ai-toolkit\output

Now, you can use it in ComfyUI via the Lora loader ! Z-Image-Turbo ~

Dec 01 '25 02:12 juntaosun

thank you, very useful.

Dec 01 '25 05:12 leetraman822

How to train the checkpoints model?

Dec 01 '25 08:12 bank010

@bank010 I trained Lora using Z-Image-Turbo.

Dec 01 '25 08:12 juntaosun

@bank010 I trained Lora using Z-Image-Turbo.

您知道怎么微调Z-image-Turbo的大模型吗？不是lora微调，全量微调

Dec 01 '25 09:12 bank010

5070TI 16G first try

Dec 01 '25 11:12 yamasoo

@yamasoo The 5070TI 16G can train at 1024 resolution.

Dec 01 '25 11:12 juntaosun

@juntaosun Thank you for sharing your information. This was very usefull. After some testing I got 1024 resolution running on my RTX 3060 with 12 GB VRAM. I used the standard settings for Z-Image Turbo (like Transformer = float8 and Resolutions = 512, 768, 1024), exept the following changes:

Optimizer: Adafactor
Learning Rate = 0,0003
Steps = 1200
Cache Text Embeddings
Cache Latents
Sample Resolution: Width = 768, Hight = 1024 (with 1024 x 1024 it hangs just at the first sample)

It runs hard at the limit at about 11,5 - 11,7 GB VRAM usage. Speed was aprox. between 6 - 10 s/itteration. With Adafactor and a Learning Rate of 0.0003 approx. 1100 steps seems to be optimal. I`m not really sure, but Cache Text Embeddings and Cache Latents seems to reduce the VRAM usage crucially. Hope this helps some others too.

Dec 01 '25 17:12 cmyknao

@cmyknao Thank you for sharing your information.

Dec 02 '25 03:12 juntaosun

Thank you very much for these settings. I was trying to use another set that claimed to work on 12GB VRAM but kept getting an OOM near the start. It hasn't finished yet but the training is progressing, which is further than before. I suspect it may have been the cache latents or layer offloading that was the issue, as they were both off in my earlier attempts. Fingers crossed for this attempt.

Dec 02 '25 11:12 wideload1971

@wideload1971 After starting, it uses more than 12 GB of VRAM for a short time. (Unfortunately, I can't remember at which step). But after that, everything runs smoothly.

Dec 02 '25 12:12 cmyknao

Thank you, works very well on my rig,

Ryzen 9 3900 + RTX 4070 Super

Dec 02 '25 21:12 ladydarkness

ty this is awesome :333

Dec 03 '25 15:12 thinhlpg

There is no z-image in the model architecture.

Dec 04 '25 06:12 7ywx

@juntaosun Thank you for sharing your information. This was very usefull. After some testing I got 1024 resolution running on my RTX 3060 with 12 GB VRAM. I used the standard settings for Z-Image Turbo (like Transformer = float8 and Resolutions = 512, 768, 1024), exept the following changes:
1. Optimizer: Adafactor

2. Learning Rate = 0,0003

3. Steps = 1200

4. Cache Text Embeddings

5. Cache Latents

6. Sample Resolution: Width = 768, Hight = 1024 (with 1024 x 1024 it hangs just at the first sample)
It runs hard at the limit at about 11,5 - 11,7 GB VRAM usage. Speed was aprox. between 6 - 10 s/itteration. With Adafactor and a Learning Rate of 0.0003 approx. 1100 steps seems to be optimal. I`m not really sure, but Cache Text Embeddings and Cache Latents seems to reduce the VRAM usage crucially. Hope this helps some others too.

sorry in advance for the "tangential" question. Do you have to add the relative_step parameter to use Adafactor?

Dec 05 '25 00:12 Signorlimone