sd-scripts icon indicating copy to clipboard operation
sd-scripts copied to clipboard

FLUX training for 8GB VRAM?

Open cocktailpeanut opened this issue 1 year ago • 12 comments

I've tried the options for 12G, 16G, and 20G VRAM options here: https://github.com/kohya-ss/sd-scripts/tree/sd3?tab=readme-ov-file#flux1-lora-training and confirm they all work.

But is it possible to do 8GB VRAM? Is there a specific combination of configs we can try to get it to work with 8GB VRAM?

cocktailpeanut avatar Sep 12 '24 15:09 cocktailpeanut

Yes you can. With split mode and a FP8 t5 model you can train a lora on 8 gb vram on 512 by 512 images.

oleg996 avatar Sep 13 '24 15:09 oleg996

@oleg996 thank you. shouldn't we include the exact combination of flags in the README along with the other options? Could you share if you have a working set of launch flags?

cocktailpeanut avatar Sep 13 '24 17:09 cocktailpeanut

/home/oleg/AI/LORA/kohya_ss/venv/bin/accelerate launch --dynamo_backend no --dynamo_mode default --mixed_precision bf16 --num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2 /home/oleg/AI/LORA/kohya_ss/sd-scripts/flux_train_network.py --config_file /home/oleg/AI/LORA/dataset/model/config_lora-20240913-210606.toml Here is the launch flags ( i am using kohya-ss gui to train)

and the toml file

                         apply_t5_attn_mask = true                                                                                                                                                  
                         bucket_reso_steps = 64                                                                                                                                                     
                         cache_latents = true                                                                                                                                                       
                         cache_latents_to_disk = true                                                                                                                                               
                         cache_text_encoder_outputs = true                                                                                                                                          
                         cache_text_encoder_outputs_to_disk = true                                                                                                                                  
                         caption_extension = ".txt"                                                                                                                                                 
                         clip_l = "/home/oleg/AI/ComfyUI/models/clip/clip_l.safetensors"                                                                                                            
                         clip_skip = 1                                                                                                                                                              
                         discrete_flow_shift = 3.1582                                                                                                                                               
                         dynamo_backend = "no"                                                                                                                                                      
                         enable_bucket = true                                                                                                                                                       
                         epoch = 40                                                                                                                                                                 
                         fp8_base = true                                                                                                                                                            
                         full_bf16 = true                                                                                                                                                           
                         gradient_accumulation_steps = 1                                                                                                                                            
                         gradient_checkpointing = true                                                                                                                                              
                         guidance_scale = 1.0                                                                                                                                                       
                         huber_c = 0.1                                                                                                                                                              
                         huber_schedule = "snr"                                                                                                                                                     
                         logging_dir = "/home/oleg/AI/LORA/dataset/log"                                                                                                                             
                         loss_type = "l2"                                                                                                                                                           
                         lr_scheduler = "constant"                                                                                                                                                  
                         lr_scheduler_args = []                                                                                                                                                     
                         lr_scheduler_num_cycles = 1                                                                                                                                                
                         lr_scheduler_power = 1                                                                                                                                                     
                         max_bucket_reso = 2048                                                                                                                                                     
                         max_data_loader_n_workers = 0                                                                                                                                              
                         max_grad_norm = 1                                                                                                                                                          
                         max_timestep = 1000                                                                                                                                                        
                         max_train_steps = 800                                                                                                                                                      
                         min_bucket_reso = 256                                                                                                                                                      
                         min_snr_gamma = 7                                                                                                                                                          
                         mixed_precision = "bf16"                                                                                                                                                   
                         model_prediction_type = "raw"                                                                                                                                              
                         network_alpha = 32                                                                                                                                                         
                         network_args = [ "train_blocks=single",]                                                                                                                                   
                         network_dim = 32                                                                                                                                                           
                         network_module = "networks.lora_flux"                                                                                                                                      
                         network_train_unet_only = true                                                                                                                                             
                         noise_offset = 0.05                                                                                                                                                        
                         noise_offset_type = "Original"                                                                                                                                             
                         optimizer_args = []                                                                                                                                                        
                         optimizer_type = "AdamW8bit"                                                                                                                                               
                         output_dir = "/home/oleg/AI/LORA/dataset/model"                                                                                                                            
                         output_name = "Flux.my-super-duper-model"                                                                                                                                  
                         pretrained_model_name_or_path = "/home/oleg/AI/ComfyUI/models/diffusion_models/flux1-dev-fp8.safetensors"                                                                  
                         prior_loss_weight = 1                                                                                                                                                      
                         resolution = "512,512"                                                                                                                                                     
                         sample_prompts = "/home/oleg/AI/LORA/dataset/model/sample/prompt.txt"                                                                                                      
                         sample_sampler = "euler"                                                                                                                                                   
                         save_every_n_epochs = 10                                                                                                                                                   
                         save_every_n_steps = 50                                                                                                                                                    
                         save_model_as = "safetensors"                                                                                                                                              
                         save_precision = "bf16"                                                                                                                                                    
                         sdpa = true                                                                                                                                                                
                         split_mode = true                                                                                                                                                          
                         t5xxl = "/home/oleg/AI/LORA/kohya_ss/models/t5xxl_fp8_e4m3fn.safetensors"                                                                                                  
                         t5xxl_max_token_length = 512                                                                                                                                               
                         timestep_sampling = "shift"                                                                                                                                                
                         train_batch_size = 1                                                                                                                                                       
                         train_data_dir = "/home/oleg/AI/LORA/dataset/img"                                                                                                                          
                         unet_lr = 0.0004                                                                                                                                                           
                         wandb_run_name = "Flux.my-super-duper-model" 

I am still trying to figure out the best settings but my 4060 mobile is limiting me.

oleg996 avatar Sep 13 '24 19:09 oleg996

I am still trying to get it to work with on my RTX 2060 Super. Currently, I am facing some issue apparently specific to RTX 20xx series, but at least I am not running out of memory on my 8 GB card atm. My setup is described here: https://github.com/bmaltais/kohya_ss/issues/2701#issuecomment-2352433098 It is partly derived from oleg's

maxanier avatar Sep 19 '24 13:09 maxanier

/home/oleg/AI/LORA/kohya_ss/venv/bin/accelerate launch --dynamo_backend no --dynamo_mode default --mixed_precision bf16 --num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2 /home/oleg/AI/LORA/kohya_ss/sd-scripts/flux_train_network.py --config_file /home/oleg/AI/LORA/dataset/model/config_lora-20240913-210606.toml Here is the launch flags ( i am using kohya-ss gui to train)

and the toml file

                         apply_t5_attn_mask = true                                                                                                                                                  
                         bucket_reso_steps = 64                                                                                                                                                     
                         cache_latents = true                                                                                                                                                       
                         cache_latents_to_disk = true                                                                                                                                               
                         cache_text_encoder_outputs = true                                                                                                                                          
                         cache_text_encoder_outputs_to_disk = true                                                                                                                                  
                         caption_extension = ".txt"                                                                                                                                                 
                         clip_l = "/home/oleg/AI/ComfyUI/models/clip/clip_l.safetensors"                                                                                                            
                         clip_skip = 1                                                                                                                                                              
                         discrete_flow_shift = 3.1582                                                                                                                                               
                         dynamo_backend = "no"                                                                                                                                                      
                         enable_bucket = true                                                                                                                                                       
                         epoch = 40                                                                                                                                                                 
                         fp8_base = true                                                                                                                                                            
                         full_bf16 = true                                                                                                                                                           
                         gradient_accumulation_steps = 1                                                                                                                                            
                         gradient_checkpointing = true                                                                                                                                              
                         guidance_scale = 1.0                                                                                                                                                       
                         huber_c = 0.1                                                                                                                                                              
                         huber_schedule = "snr"                                                                                                                                                     
                         logging_dir = "/home/oleg/AI/LORA/dataset/log"                                                                                                                             
                         loss_type = "l2"                                                                                                                                                           
                         lr_scheduler = "constant"                                                                                                                                                  
                         lr_scheduler_args = []                                                                                                                                                     
                         lr_scheduler_num_cycles = 1                                                                                                                                                
                         lr_scheduler_power = 1                                                                                                                                                     
                         max_bucket_reso = 2048                                                                                                                                                     
                         max_data_loader_n_workers = 0                                                                                                                                              
                         max_grad_norm = 1                                                                                                                                                          
                         max_timestep = 1000                                                                                                                                                        
                         max_train_steps = 800                                                                                                                                                      
                         min_bucket_reso = 256                                                                                                                                                      
                         min_snr_gamma = 7                                                                                                                                                          
                         mixed_precision = "bf16"                                                                                                                                                   
                         model_prediction_type = "raw"                                                                                                                                              
                         network_alpha = 32                                                                                                                                                         
                         network_args = [ "train_blocks=single",]                                                                                                                                   
                         network_dim = 32                                                                                                                                                           
                         network_module = "networks.lora_flux"                                                                                                                                      
                         network_train_unet_only = true                                                                                                                                             
                         noise_offset = 0.05                                                                                                                                                        
                         noise_offset_type = "Original"                                                                                                                                             
                         optimizer_args = []                                                                                                                                                        
                         optimizer_type = "AdamW8bit"                                                                                                                                               
                         output_dir = "/home/oleg/AI/LORA/dataset/model"                                                                                                                            
                         output_name = "Flux.my-super-duper-model"                                                                                                                                  
                         pretrained_model_name_or_path = "/home/oleg/AI/ComfyUI/models/diffusion_models/flux1-dev-fp8.safetensors"                                                                  
                         prior_loss_weight = 1                                                                                                                                                      
                         resolution = "512,512"                                                                                                                                                     
                         sample_prompts = "/home/oleg/AI/LORA/dataset/model/sample/prompt.txt"                                                                                                      
                         sample_sampler = "euler"                                                                                                                                                   
                         save_every_n_epochs = 10                                                                                                                                                   
                         save_every_n_steps = 50                                                                                                                                                    
                         save_model_as = "safetensors"                                                                                                                                              
                         save_precision = "bf16"                                                                                                                                                    
                         sdpa = true                                                                                                                                                                
                         split_mode = true                                                                                                                                                          
                         t5xxl = "/home/oleg/AI/LORA/kohya_ss/models/t5xxl_fp8_e4m3fn.safetensors"                                                                                                  
                         t5xxl_max_token_length = 512                                                                                                                                               
                         timestep_sampling = "shift"                                                                                                                                                
                         train_batch_size = 1                                                                                                                                                       
                         train_data_dir = "/home/oleg/AI/LORA/dataset/img"                                                                                                                          
                         unet_lr = 0.0004                                                                                                                                                           
                         wandb_run_name = "Flux.my-super-duper-model" 

I am still trying to figure out the best settings but my 4060 mobile is limiting me.

Waiting for the follow-up, can we successfully train?

wzgrx avatar Sep 21 '24 13:09 wzgrx

Waiting for the follow-up, can we successfully train?

What about using Adafactor instead of AdamW8bit?

--optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" --split_mode --network_args "train_blocks=single" --lr_scheduler constant_with_warmup --max_grad_norm 0.0

kohya-ss avatar Sep 22 '24 13:09 kohya-ss

Waiting for the follow-up, can we successfully train?

What about using Adafactor instead of AdamW8bit?

--optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" --split_mode --network_args "train_blocks=single" --lr_scheduler constant_with_warmup --max_grad_norm 0.0

Trying a 64/64 dim/alpha fp8 lora . The vram usage is somewhere of 7.5 gb.

oleg996 avatar Sep 23 '24 10:09 oleg996

Waiting for the follow-up, can we successfully train?

What about using Adafactor instead of AdamW8bit?

--optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" --split_mode --network_args "train_blocks=single" --lr_scheduler constant_with_warmup --max_grad_norm 0.0

Also does Adafactor need higher LR?

oleg996 avatar Sep 23 '24 10:09 oleg996

Also does Adafactor need higher LR?

When relative_step=False, in my experience it is good to start with a learning rate roughly the same as AdamW.

kohya-ss avatar Sep 23 '24 12:09 kohya-ss

Do you have a complete JS file configuration? Can you send one? I am testing flux training with 8GB video memory

wzgrx avatar Sep 23 '24 14:09 wzgrx

Unfortunately, I have not personally verified training with 8GB.

kohya-ss avatar Sep 25 '24 10:09 kohya-ss

Do you have a complete JS file configuration? Can you send one? I am testing flux training with 8GB video memory The JS file

{
  "LoRA_type": "Flux1",
  "LyCORIS_preset": "full",
  "adaptive_noise_scale": 0,
  "additional_parameters": "",
  "ae": "/home/oleg/AI/ComfyUI/models/vae/fluxdevvae.safetensors",
  "apply_t5_attn_mask": true,
  "async_upload": false,
  "block_alphas": "",
  "block_dims": "",
  "block_lr_zero_threshold": "",
  "bucket_no_upscale": true,
  "bucket_reso_steps": 64,
  "bypass_mode": false,
  "cache_latents": true,
  "cache_latents_to_disk": true,
  "caption_dropout_every_n_epochs": 0,
  "caption_dropout_rate": 0,
  "caption_extension": ".txt",
  "clip_l": "/home/oleg/AI/ComfyUI/models/clip/clip_l.safetensors",
  "clip_skip": 1,
  "color_aug": false,
  "constrain": 0,
  "conv_alpha": 1,
  "conv_block_alphas": "",
  "conv_block_dims": "",
  "conv_dim": 1,
  "cpu_offload_checkpointing": false,
  "dataset_config": "",
  "debiased_estimation_loss": false,
  "decompose_both": false,
  "dim_from_weights": false,
  "discrete_flow_shift": 3.1582,
  "dora_wd": false,
  "down_lr_weight": "",
  "dynamo_backend": "no",
  "dynamo_mode": "default",
  "dynamo_use_dynamic": false,
  "dynamo_use_fullgraph": false,
  "enable_all_linear": false,
  "enable_bucket": true,
  "epoch": 100,
  "extra_accelerate_launch_args": "",
  "factor": -1,
  "flip_aug": false,
  "flux1_cache_text_encoder_outputs": true,
  "flux1_cache_text_encoder_outputs_to_disk": true,
  "flux1_checkbox": true,
  "fp8_base": true,
  "fp8_base_unet": false,
  "full_bf16": true,
  "full_fp16": false,
  "gpu_ids": "",
  "gradient_accumulation_steps": 1,
  "gradient_checkpointing": true,
  "guidance_scale": 1,
  "highvram": false,
  "huber_c": 0.1,
  "huber_schedule": "snr",
  "huggingface_path_in_repo": "",
  "huggingface_repo_id": "",
  "huggingface_repo_type": "",
  "huggingface_repo_visibility": "",
  "huggingface_token": "",
  "img_attn_dim": "",
  "img_mlp_dim": "",
  "img_mod_dim": "",
  "in_dims": "",
  "ip_noise_gamma": 0,
  "ip_noise_gamma_random_strength": false,
  "keep_tokens": 0,
  "learning_rate": 0,
  "log_config": false,
  "log_tracker_config": "",
  "log_tracker_name": "",
  "log_with": "",
  "logging_dir": "/home/oleg/AI/LORA/dataset/log",
  "loraplus_lr_ratio": 0,
  "loraplus_text_encoder_lr_ratio": 0,
  "loraplus_unet_lr_ratio": 0,
  "loss_type": "l2",
  "lowvram": false,
  "lr_scheduler": "constant_with_warmup",
  "lr_scheduler_args": "",
  "lr_scheduler_num_cycles": 1,
  "lr_scheduler_power": 1,
  "lr_scheduler_type": "",
  "lr_warmup": 0,
  "lr_warmup_steps": 0,
  "main_process_port": 0,
  "masked_loss": false,
  "max_bucket_reso": 2048,
  "max_data_loader_n_workers": 0,
  "max_grad_norm": 0,
  "max_resolution": "512,512",
  "max_timestep": 1000,
  "max_token_length": 225,
  "max_train_epochs": 0,
  "max_train_steps": 1000,
  "mem_eff_attn": false,
  "mem_eff_save": false,
  "metadata_author": "",
  "metadata_description": "",
  "metadata_license": "",
  "metadata_tags": "",
  "metadata_title": "",
  "mid_lr_weight": "",
  "min_bucket_reso": 256,
  "min_snr_gamma": 7,
  "min_timestep": 0,
  "mixed_precision": "bf16",
  "model_list": "custom",
  "model_prediction_type": "raw",
  "module_dropout": 0,
  "multi_gpu": false,
  "multires_noise_discount": 0.3,
  "multires_noise_iterations": 0,
  "network_alpha": 64,
  "network_dim": 64,
  "network_dropout": 0,
  "network_weights": "",
  "noise_offset": 0.05,
  "noise_offset_random_strength": false,
  "noise_offset_type": "Original",
  "num_cpu_threads_per_process": 2,
  "num_machines": 1,
  "num_processes": 1,
  "optimizer": "Adafactor",
  "optimizer_args": "\"relative_step=False\" \"scale_parameter=False\" \"warmup_init=False\"",
  "output_dir": "/home/oleg/AI/LORA/dataset/model",
  "output_name": "oleg_lora",
  "persistent_data_loader_workers": false,
  "pretrained_model_name_or_path": "/home/oleg/AI/ComfyUI/models/diffusion_models/flux1-dev-fp8.safetensors",
  "prior_loss_weight": 1,
  "random_crop": false,
  "rank_dropout": 0,
  "rank_dropout_scale": false,
  "reg_data_dir": "",
  "rescaled": false,
  "resume": "",
  "resume_from_huggingface": "",
  "sample_every_n_epochs": 0,
  "sample_every_n_steps": 0,
  "sample_prompts": "saruman posing under a stormy lightning sky, photorealistic --w 832 --h 1216 --s 20 --l 4 --d 42",
  "sample_sampler": "euler",
  "save_every_n_epochs": 1,
  "save_every_n_steps": 50,
  "save_last_n_steps": 0,
  "save_last_n_steps_state": 0,
  "save_model_as": "safetensors",
  "save_precision": "bf16",
  "save_state": false,
  "save_state_on_train_end": false,
  "save_state_to_huggingface": false,
  "scale_v_pred_loss_like_noise_pred": false,
  "scale_weight_norms": 0,
  "sdxl": false,
  "sdxl_cache_text_encoder_outputs": true,
  "sdxl_no_half_vae": true,
  "seed": 0,
  "shuffle_caption": false,
  "single_dim": "",
  "single_mod_dim": "",
  "split_mode": true,
  "split_qkv": false,
  "stop_text_encoder_training": 0,
  "t5xxl": "/home/oleg/AI/LORA/kohya_ss/models/t5xxl_fp8_e4m3fn.safetensors",
  "t5xxl_lr": 0,
  "t5xxl_max_token_length": 512,
  "text_encoder_lr": 0.0004,
  "timestep_sampling": "shift",
  "train_batch_size": 1,
  "train_blocks": "single",
  "train_data_dir": "/home/oleg/AI/LORA/dataset/img",
  "train_double_block_indices": "all",
  "train_norm": false,
  "train_on_input": true,
  "train_single_block_indices": "all",
  "train_t5xxl": false,
  "training_comment": "",
  "txt_attn_dim": "",
  "txt_mlp_dim": "",
  "txt_mod_dim": "",
  "unet_lr": 0.0004,
  "unit": 1,
  "up_lr_weight": "",
  "use_cp": false,
  "use_scalar": false,
  "use_tucker": false,
  "v2": false,
  "v_parameterization": false,
  "v_pred_like_loss": 0,
  "vae": "",
  "vae_batch_size": 0,
  "wandb_api_key": "",
  "wandb_run_name": "",
  "weighted_captions": false,
  "xformers": "sdpa"
}

I trained a lora on 10 images of myself (selfies) . The output is OK but strangely smooth and blurry.Is something wrong with the config?

oleg996 avatar Sep 25 '24 13:09 oleg996

8GB is possible but on 2000 series nvidia it doesn't support BF16, which Flux is stable in for training. Training in FP16 causes the loss to immediately NaN. You can set it to be BF16 for mixed precision and the model at BF16 with fp8 base. The downside of this approach is since it doesn't support BF16 it converts the values to FP32 before doing calculations and becomes quite slow. Also doesn't utilize the memory advantage so it gets really close to capping at BF16 with all available improvements.

mixed_precision="bf16" fp8_base=true

accelerate launch --mixed_precision bf16 # Or --mixed_precision no

largest amount of block swaps, gradient checkpointing, cache latents and text encoder outputs. Batch size of 1.

If you have 8GB on 3000 series or higher it is probably more feasible as it's nearer to 6GB (don't quote me on that) when using BF16 (comparing to usage with fp16 with NaN loss).

rockerBOO avatar Jan 29 '25 21:01 rockerBOO

I've been doing some further poking and got the VRAM down to 5.8GB by maximizing the block swaps (beyond whats configurable right now) and on my CPU seems to have minimal impact on the step speed. Still on 8GB 2080 with mixed_precision of bf16 (which converts to fp32 effectively). If you have a 3000 or higher series nvidia card it can probably be lower than that due to it supporting bf16.

rockerBOO avatar Jan 31 '25 05:01 rockerBOO