[Sharing Experience] Training Z-Image LoRA using 12G VRAM ~ 😁
(1) Datasets Training Preparation: Image material with a maximum side size of 768.
🎉 To minimize VRAM usage, after extensive testing, training was successful with 6~10 images, yielding good LoRa results.
A maximum image side resolution of 1024 might cause insufficient 12G VRAM. You can try it!
(2) New Job Creation: Select Z-Image Turbo, and then set your model path. Follow the settings in the screenshot below.
You need to enter your own trigger words! This is just an example !
⚠️ Remember to set the Transformer Offload to 0%; we won't use it because it will throw an error, and we're unsure if this is a bug.
Correction: Learning Rate 0.0001 ~ 0.0002 !
👉 After extensive testing, 2000 steps is a suitable value.
Datasets: Cache Latents and Resolution 512, 768 ~
A maximum image side resolution of 1024 might cause insufficient 12G VRAM. You can try it!
👆 As you can see, training started successfully on 12G VRAM, and the speed is quite good !👇
first_lora_v1: 26%|##5 | 519/2000 [24:43<1:10:54, 2.87s/it, lr: 2.0e-04 loss: 3.811e-01]
Training speed is approximately 2~3 seconds/it, 2000 steps take about 1 - 2 hour to complete.
Finally: Wishing users with low VRAM success in training their own z-image LoRA!
Thanks to ai-toolkit and z-image, have fun! If you have better training settings, please share! 🤗
This is my second z-image LoRA. ai-toolkit\output
Now, you can use it in ComfyUI via the Lora loader ! Z-Image-Turbo ~
thank you, very useful.
How to train the checkpoints model?
@bank010 I trained Lora using Z-Image-Turbo.
5070TI 16G first try
@yamasoo The 5070TI 16G can train at 1024 resolution.
@juntaosun Thank you for sharing your information. This was very usefull. After some testing I got 1024 resolution running on my RTX 3060 with 12 GB VRAM. I used the standard settings for Z-Image Turbo (like Transformer = float8 and Resolutions = 512, 768, 1024), exept the following changes:
- Optimizer: Adafactor
- Learning Rate = 0,0003
- Steps = 1200
- Cache Text Embeddings
- Cache Latents
- Sample Resolution: Width = 768, Hight = 1024 (with 1024 x 1024 it hangs just at the first sample)
It runs hard at the limit at about 11,5 - 11,7 GB VRAM usage. Speed was aprox. between 6 - 10 s/itteration. With Adafactor and a Learning Rate of 0.0003 approx. 1100 steps seems to be optimal. I`m not really sure, but Cache Text Embeddings and Cache Latents seems to reduce the VRAM usage crucially. Hope this helps some others too.
@cmyknao Thank you for sharing your information.
Thank you very much for these settings. I was trying to use another set that claimed to work on 12GB VRAM but kept getting an OOM near the start. It hasn't finished yet but the training is progressing, which is further than before. I suspect it may have been the cache latents or layer offloading that was the issue, as they were both off in my earlier attempts. Fingers crossed for this attempt.
@wideload1971 After starting, it uses more than 12 GB of VRAM for a short time. (Unfortunately, I can't remember at which step). But after that, everything runs smoothly.
Thank you, works very well on my rig,
Ryzen 9 3900 + RTX 4070 Super
ty this is awesome :333
There is no z-image in the model architecture.
@juntaosun Thank you for sharing your information. This was very usefull. After some testing I got 1024 resolution running on my RTX 3060 with 12 GB VRAM. I used the standard settings for Z-Image Turbo (like Transformer = float8 and Resolutions = 512, 768, 1024), exept the following changes:
1. Optimizer: Adafactor 2. Learning Rate = 0,0003 3. Steps = 1200 4. Cache Text Embeddings 5. Cache Latents 6. Sample Resolution: Width = 768, Hight = 1024 (with 1024 x 1024 it hangs just at the first sample)It runs hard at the limit at about 11,5 - 11,7 GB VRAM usage. Speed was aprox. between 6 - 10 s/itteration. With Adafactor and a Learning Rate of 0.0003 approx. 1100 steps seems to be optimal. I`m not really sure, but Cache Text Embeddings and Cache Latents seems to reduce the VRAM usage crucially. Hope this helps some others too.
sorry in advance for the "tangential" question. Do you have to add the relative_step parameter to use Adafactor?