sd-scripts
sd-scripts copied to clipboard
FLUX training for 8GB VRAM?
I've tried the options for 12G, 16G, and 20G VRAM options here: https://github.com/kohya-ss/sd-scripts/tree/sd3?tab=readme-ov-file#flux1-lora-training and confirm they all work.
But is it possible to do 8GB VRAM? Is there a specific combination of configs we can try to get it to work with 8GB VRAM?
Yes you can. With split mode and a FP8 t5 model you can train a lora on 8 gb vram on 512 by 512 images.
@oleg996 thank you. shouldn't we include the exact combination of flags in the README along with the other options? Could you share if you have a working set of launch flags?
/home/oleg/AI/LORA/kohya_ss/venv/bin/accelerate launch --dynamo_backend no --dynamo_mode default --mixed_precision bf16 --num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2 /home/oleg/AI/LORA/kohya_ss/sd-scripts/flux_train_network.py --config_file /home/oleg/AI/LORA/dataset/model/config_lora-20240913-210606.toml
Here is the launch flags ( i am using kohya-ss gui to train)
and the toml file
apply_t5_attn_mask = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
cache_text_encoder_outputs = true
cache_text_encoder_outputs_to_disk = true
caption_extension = ".txt"
clip_l = "/home/oleg/AI/ComfyUI/models/clip/clip_l.safetensors"
clip_skip = 1
discrete_flow_shift = 3.1582
dynamo_backend = "no"
enable_bucket = true
epoch = 40
fp8_base = true
full_bf16 = true
gradient_accumulation_steps = 1
gradient_checkpointing = true
guidance_scale = 1.0
huber_c = 0.1
huber_schedule = "snr"
logging_dir = "/home/oleg/AI/LORA/dataset/log"
loss_type = "l2"
lr_scheduler = "constant"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
max_bucket_reso = 2048
max_data_loader_n_workers = 0
max_grad_norm = 1
max_timestep = 1000
max_train_steps = 800
min_bucket_reso = 256
min_snr_gamma = 7
mixed_precision = "bf16"
model_prediction_type = "raw"
network_alpha = 32
network_args = [ "train_blocks=single",]
network_dim = 32
network_module = "networks.lora_flux"
network_train_unet_only = true
noise_offset = 0.05
noise_offset_type = "Original"
optimizer_args = []
optimizer_type = "AdamW8bit"
output_dir = "/home/oleg/AI/LORA/dataset/model"
output_name = "Flux.my-super-duper-model"
pretrained_model_name_or_path = "/home/oleg/AI/ComfyUI/models/diffusion_models/flux1-dev-fp8.safetensors"
prior_loss_weight = 1
resolution = "512,512"
sample_prompts = "/home/oleg/AI/LORA/dataset/model/sample/prompt.txt"
sample_sampler = "euler"
save_every_n_epochs = 10
save_every_n_steps = 50
save_model_as = "safetensors"
save_precision = "bf16"
sdpa = true
split_mode = true
t5xxl = "/home/oleg/AI/LORA/kohya_ss/models/t5xxl_fp8_e4m3fn.safetensors"
t5xxl_max_token_length = 512
timestep_sampling = "shift"
train_batch_size = 1
train_data_dir = "/home/oleg/AI/LORA/dataset/img"
unet_lr = 0.0004
wandb_run_name = "Flux.my-super-duper-model"
I am still trying to figure out the best settings but my 4060 mobile is limiting me.
I am still trying to get it to work with on my RTX 2060 Super. Currently, I am facing some issue apparently specific to RTX 20xx series, but at least I am not running out of memory on my 8 GB card atm. My setup is described here: https://github.com/bmaltais/kohya_ss/issues/2701#issuecomment-2352433098 It is partly derived from oleg's
/home/oleg/AI/LORA/kohya_ss/venv/bin/accelerate launch --dynamo_backend no --dynamo_mode default --mixed_precision bf16 --num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2 /home/oleg/AI/LORA/kohya_ss/sd-scripts/flux_train_network.py --config_file /home/oleg/AI/LORA/dataset/model/config_lora-20240913-210606.tomlHere is the launch flags ( i am using kohya-ss gui to train)and the toml file
apply_t5_attn_mask = true bucket_reso_steps = 64 cache_latents = true cache_latents_to_disk = true cache_text_encoder_outputs = true cache_text_encoder_outputs_to_disk = true caption_extension = ".txt" clip_l = "/home/oleg/AI/ComfyUI/models/clip/clip_l.safetensors" clip_skip = 1 discrete_flow_shift = 3.1582 dynamo_backend = "no" enable_bucket = true epoch = 40 fp8_base = true full_bf16 = true gradient_accumulation_steps = 1 gradient_checkpointing = true guidance_scale = 1.0 huber_c = 0.1 huber_schedule = "snr" logging_dir = "/home/oleg/AI/LORA/dataset/log" loss_type = "l2" lr_scheduler = "constant" lr_scheduler_args = [] lr_scheduler_num_cycles = 1 lr_scheduler_power = 1 max_bucket_reso = 2048 max_data_loader_n_workers = 0 max_grad_norm = 1 max_timestep = 1000 max_train_steps = 800 min_bucket_reso = 256 min_snr_gamma = 7 mixed_precision = "bf16" model_prediction_type = "raw" network_alpha = 32 network_args = [ "train_blocks=single",] network_dim = 32 network_module = "networks.lora_flux" network_train_unet_only = true noise_offset = 0.05 noise_offset_type = "Original" optimizer_args = [] optimizer_type = "AdamW8bit" output_dir = "/home/oleg/AI/LORA/dataset/model" output_name = "Flux.my-super-duper-model" pretrained_model_name_or_path = "/home/oleg/AI/ComfyUI/models/diffusion_models/flux1-dev-fp8.safetensors" prior_loss_weight = 1 resolution = "512,512" sample_prompts = "/home/oleg/AI/LORA/dataset/model/sample/prompt.txt" sample_sampler = "euler" save_every_n_epochs = 10 save_every_n_steps = 50 save_model_as = "safetensors" save_precision = "bf16" sdpa = true split_mode = true t5xxl = "/home/oleg/AI/LORA/kohya_ss/models/t5xxl_fp8_e4m3fn.safetensors" t5xxl_max_token_length = 512 timestep_sampling = "shift" train_batch_size = 1 train_data_dir = "/home/oleg/AI/LORA/dataset/img" unet_lr = 0.0004 wandb_run_name = "Flux.my-super-duper-model"I am still trying to figure out the best settings but my 4060 mobile is limiting me.
Waiting for the follow-up, can we successfully train?
Waiting for the follow-up, can we successfully train?
What about using Adafactor instead of AdamW8bit?
--optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" --split_mode --network_args "train_blocks=single" --lr_scheduler constant_with_warmup --max_grad_norm 0.0
Waiting for the follow-up, can we successfully train?
What about using Adafactor instead of AdamW8bit?
--optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" --split_mode --network_args "train_blocks=single" --lr_scheduler constant_with_warmup --max_grad_norm 0.0
Trying a 64/64 dim/alpha fp8 lora . The vram usage is somewhere of 7.5 gb.
Waiting for the follow-up, can we successfully train?
What about using Adafactor instead of AdamW8bit?
--optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" --split_mode --network_args "train_blocks=single" --lr_scheduler constant_with_warmup --max_grad_norm 0.0
Also does Adafactor need higher LR?
Also does Adafactor need higher LR?
When relative_step=False, in my experience it is good to start with a learning rate roughly the same as AdamW.
Do you have a complete JS file configuration? Can you send one? I am testing flux training with 8GB video memory
Unfortunately, I have not personally verified training with 8GB.
Do you have a complete JS file configuration? Can you send one? I am testing flux training with 8GB video memory The JS file
{
"LoRA_type": "Flux1",
"LyCORIS_preset": "full",
"adaptive_noise_scale": 0,
"additional_parameters": "",
"ae": "/home/oleg/AI/ComfyUI/models/vae/fluxdevvae.safetensors",
"apply_t5_attn_mask": true,
"async_upload": false,
"block_alphas": "",
"block_dims": "",
"block_lr_zero_threshold": "",
"bucket_no_upscale": true,
"bucket_reso_steps": 64,
"bypass_mode": false,
"cache_latents": true,
"cache_latents_to_disk": true,
"caption_dropout_every_n_epochs": 0,
"caption_dropout_rate": 0,
"caption_extension": ".txt",
"clip_l": "/home/oleg/AI/ComfyUI/models/clip/clip_l.safetensors",
"clip_skip": 1,
"color_aug": false,
"constrain": 0,
"conv_alpha": 1,
"conv_block_alphas": "",
"conv_block_dims": "",
"conv_dim": 1,
"cpu_offload_checkpointing": false,
"dataset_config": "",
"debiased_estimation_loss": false,
"decompose_both": false,
"dim_from_weights": false,
"discrete_flow_shift": 3.1582,
"dora_wd": false,
"down_lr_weight": "",
"dynamo_backend": "no",
"dynamo_mode": "default",
"dynamo_use_dynamic": false,
"dynamo_use_fullgraph": false,
"enable_all_linear": false,
"enable_bucket": true,
"epoch": 100,
"extra_accelerate_launch_args": "",
"factor": -1,
"flip_aug": false,
"flux1_cache_text_encoder_outputs": true,
"flux1_cache_text_encoder_outputs_to_disk": true,
"flux1_checkbox": true,
"fp8_base": true,
"fp8_base_unet": false,
"full_bf16": true,
"full_fp16": false,
"gpu_ids": "",
"gradient_accumulation_steps": 1,
"gradient_checkpointing": true,
"guidance_scale": 1,
"highvram": false,
"huber_c": 0.1,
"huber_schedule": "snr",
"huggingface_path_in_repo": "",
"huggingface_repo_id": "",
"huggingface_repo_type": "",
"huggingface_repo_visibility": "",
"huggingface_token": "",
"img_attn_dim": "",
"img_mlp_dim": "",
"img_mod_dim": "",
"in_dims": "",
"ip_noise_gamma": 0,
"ip_noise_gamma_random_strength": false,
"keep_tokens": 0,
"learning_rate": 0,
"log_config": false,
"log_tracker_config": "",
"log_tracker_name": "",
"log_with": "",
"logging_dir": "/home/oleg/AI/LORA/dataset/log",
"loraplus_lr_ratio": 0,
"loraplus_text_encoder_lr_ratio": 0,
"loraplus_unet_lr_ratio": 0,
"loss_type": "l2",
"lowvram": false,
"lr_scheduler": "constant_with_warmup",
"lr_scheduler_args": "",
"lr_scheduler_num_cycles": 1,
"lr_scheduler_power": 1,
"lr_scheduler_type": "",
"lr_warmup": 0,
"lr_warmup_steps": 0,
"main_process_port": 0,
"masked_loss": false,
"max_bucket_reso": 2048,
"max_data_loader_n_workers": 0,
"max_grad_norm": 0,
"max_resolution": "512,512",
"max_timestep": 1000,
"max_token_length": 225,
"max_train_epochs": 0,
"max_train_steps": 1000,
"mem_eff_attn": false,
"mem_eff_save": false,
"metadata_author": "",
"metadata_description": "",
"metadata_license": "",
"metadata_tags": "",
"metadata_title": "",
"mid_lr_weight": "",
"min_bucket_reso": 256,
"min_snr_gamma": 7,
"min_timestep": 0,
"mixed_precision": "bf16",
"model_list": "custom",
"model_prediction_type": "raw",
"module_dropout": 0,
"multi_gpu": false,
"multires_noise_discount": 0.3,
"multires_noise_iterations": 0,
"network_alpha": 64,
"network_dim": 64,
"network_dropout": 0,
"network_weights": "",
"noise_offset": 0.05,
"noise_offset_random_strength": false,
"noise_offset_type": "Original",
"num_cpu_threads_per_process": 2,
"num_machines": 1,
"num_processes": 1,
"optimizer": "Adafactor",
"optimizer_args": "\"relative_step=False\" \"scale_parameter=False\" \"warmup_init=False\"",
"output_dir": "/home/oleg/AI/LORA/dataset/model",
"output_name": "oleg_lora",
"persistent_data_loader_workers": false,
"pretrained_model_name_or_path": "/home/oleg/AI/ComfyUI/models/diffusion_models/flux1-dev-fp8.safetensors",
"prior_loss_weight": 1,
"random_crop": false,
"rank_dropout": 0,
"rank_dropout_scale": false,
"reg_data_dir": "",
"rescaled": false,
"resume": "",
"resume_from_huggingface": "",
"sample_every_n_epochs": 0,
"sample_every_n_steps": 0,
"sample_prompts": "saruman posing under a stormy lightning sky, photorealistic --w 832 --h 1216 --s 20 --l 4 --d 42",
"sample_sampler": "euler",
"save_every_n_epochs": 1,
"save_every_n_steps": 50,
"save_last_n_steps": 0,
"save_last_n_steps_state": 0,
"save_model_as": "safetensors",
"save_precision": "bf16",
"save_state": false,
"save_state_on_train_end": false,
"save_state_to_huggingface": false,
"scale_v_pred_loss_like_noise_pred": false,
"scale_weight_norms": 0,
"sdxl": false,
"sdxl_cache_text_encoder_outputs": true,
"sdxl_no_half_vae": true,
"seed": 0,
"shuffle_caption": false,
"single_dim": "",
"single_mod_dim": "",
"split_mode": true,
"split_qkv": false,
"stop_text_encoder_training": 0,
"t5xxl": "/home/oleg/AI/LORA/kohya_ss/models/t5xxl_fp8_e4m3fn.safetensors",
"t5xxl_lr": 0,
"t5xxl_max_token_length": 512,
"text_encoder_lr": 0.0004,
"timestep_sampling": "shift",
"train_batch_size": 1,
"train_blocks": "single",
"train_data_dir": "/home/oleg/AI/LORA/dataset/img",
"train_double_block_indices": "all",
"train_norm": false,
"train_on_input": true,
"train_single_block_indices": "all",
"train_t5xxl": false,
"training_comment": "",
"txt_attn_dim": "",
"txt_mlp_dim": "",
"txt_mod_dim": "",
"unet_lr": 0.0004,
"unit": 1,
"up_lr_weight": "",
"use_cp": false,
"use_scalar": false,
"use_tucker": false,
"v2": false,
"v_parameterization": false,
"v_pred_like_loss": 0,
"vae": "",
"vae_batch_size": 0,
"wandb_api_key": "",
"wandb_run_name": "",
"weighted_captions": false,
"xformers": "sdpa"
}
I trained a lora on 10 images of myself (selfies) . The output is OK but strangely smooth and blurry.Is something wrong with the config?
8GB is possible but on 2000 series nvidia it doesn't support BF16, which Flux is stable in for training. Training in FP16 causes the loss to immediately NaN. You can set it to be BF16 for mixed precision and the model at BF16 with fp8 base. The downside of this approach is since it doesn't support BF16 it converts the values to FP32 before doing calculations and becomes quite slow. Also doesn't utilize the memory advantage so it gets really close to capping at BF16 with all available improvements.
mixed_precision="bf16" fp8_base=true
accelerate launch --mixed_precision bf16 # Or --mixed_precision no
largest amount of block swaps, gradient checkpointing, cache latents and text encoder outputs. Batch size of 1.
If you have 8GB on 3000 series or higher it is probably more feasible as it's nearer to 6GB (don't quote me on that) when using BF16 (comparing to usage with fp16 with NaN loss).
I've been doing some further poking and got the VRAM down to 5.8GB by maximizing the block swaps (beyond whats configurable right now) and on my CPU seems to have minimal impact on the step speed. Still on 8GB 2080 with mixed_precision of bf16 (which converts to fp32 effectively). If you have a 3000 or higher series nvidia card it can probably be lower than that due to it supporting bf16.