glid-3-xl
glid-3-xl copied to clipboard
How to finetune latent diffusion model
Hello @Jack000,
I found that you finetuned "jack" model used in the latent diffusion notebook here (https://www.kaggle.com/code/litevex/lite-s-latent-diffusion-v9-with-gradio) Could you please guide on how to improve/finetune jack model on wider dataset such as VQGAN Pairs or others as the quality is not good on some of the styles and prompts. Also please share the expected time (in hours) taken by the model and GPU specification if possible.
Thanks. Shan
Hi @alishan2040, The code to finetune is in the readme, it's expecting pretty high VRAM requirements so I used it only on A100. With one A100 and a batch size of 32 you can expect to get 10k steps in 24h of training and should already have an effect on the style of the output
HI @limiteinductive, Thanks for the reply and details on finetune the model. As @Jack000 mentioned in another thread that finetuning on cleaner dataset will get rid of watermark, but lose some flat illustration style of base model. My question is: Is there any hints on preparing finetuning datasets to add new styles and keep existing styles as much as possible? Is it good to include a percentage of original LAION data for keeping the original distribution, besides introducing new datasets with?
I'm running some tests myself, so I don't have a definite answer. My theory is that the style of the images generated by glid-3-xl is a very shallow feature. So if you gather a dataset of a different visual style and train for only 5000 epochs, you will notice considerable differences in most generated samples. But if you prompt something particular, you will still be able to "dig out" images from another visual style.
@limiteinductive As @Jack000 mentioned that some preprocessing and cleaning was performed on LAION-400M dataset before training jack model. Could you please share those tips and/or recommendations for fine-tuning and incorporating more datasets to improve quality?
I'm running some tests myself, so I don't have a definite answer. My theory is that the style of the images generated by glid-3-xl is a very shallow feature. So if you gather a dataset of a different visual style and train for only 5000 epochs, you will notice considerable differences in most generated samples. But if you prompt something particular, you will still be able to "dig out" images from another visual style.
Could you please clarify on '5000 epoch or 5000 steps'? and what's the batch size and the size of dataset? Thanks!