kohya_ss
kohya_ss copied to clipboard
Won't go over one epoch
I am trying to train an SDXL LoRA and for some reason, it will not train over one epoch. I need it to go for ten. It recognizes that I put in for 10 epoch but then when it starts training, it says that it is only going for one epoch.
What is the train command?
Can you share the toml file?
I have the same problem and I did not see this post. I'm sharing in my case the toml file but as you can see there is set as 10 epoch.
bucket_no_upscale = true bucket_reso_steps = 64 cache_latents = true cache_latents_to_disk = true caption_extension = ".txt" clip_skip = 1 dynamo_backend = "no" enable_bucket = true epoch = 10 gradient_accumulation_steps = 1 gradient_checkpointing = true huber_c = 0.1 huber_schedule = "snr" learning_rate = 0.0003 logging_dir = "C:/EnriqueModels/kohya_ss/satelite_images/data_training/log" loss_type = "l2" lr_scheduler = "constant" lr_scheduler_args = [] lr_scheduler_num_cycles = 1 lr_scheduler_power = 1 max_bucket_reso = 2048 max_data_loader_n_workers = 0 max_grad_norm = 1 max_timestep = 1000 max_token_length = 75 max_train_steps = 1600 min_bucket_reso = 256 mixed_precision = "fp16" multires_noise_discount = 0.3 network_alpha = 1 network_args = [] network_dim = 8 network_module = "networks.lora" noise_offset_type = "Original" optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False",] optimizer_type = "Adafactor" output_dir = "C:/EnriqueModels/kohya_ss/satelite_images/data_training/model" output_name = "last" pretrained_model_name_or_path = "runwayml/stable-diffusion-v1-5" prior_loss_weight = 1 resolution = "512,512" sample_prompts = "C:/EnriqueModels/kohya_ss/satelite_images/data_training/model\\prompt.txt" sample_sampler = "euler_a" save_every_n_epochs = 1 save_model_as = "safetensors" save_precision = "fp16" text_encoder_lr = 0.0003 train_batch_size = 1 train_data_dir = "C:/EnriqueModels/kohya_ss/satelite_images/data_training/img" unet_lr = 0.0003 xformers = true
Try changing max_train_steps from 1600 to 0
Hummm, I checked the code and if it is set to 0 it will be set to 1600 by as scripts… try setting it to 16000000 so your training never reach it. I will see if there is a way to make as-scripts disregard max train steps by forcing it to 0 and passing that to the sd-scripts.
It's weird because it's like the epochs are calculated automatically because when I set to 2 to the same number of images the model was trained during 7 epochs
There might be some conflict since you specify both epocs and number of steps. They do the same thing.
edit: When i checked it --max_train_epoch
overrides --max_train_steps
, so it should be the opposite of your problem tho.
Try printing the command since thats what accelearate is going to run.
accelerate launch --num_cpu_threads_per_process=2 "C:\Train\kohya/sd-scripts/train_db.py" --lr_scheduler_type "torch.optim.lr_scheduler.MultiStepLR" --bucket_reso_steps=64 --cache_latents --cache_latents_to_disk --caption_extension=".txt" --enable_bucket --min_bucket_reso=480 --max_bucket_reso=1560 --gradient_accumulation_steps=1 --learning_rate="1e-07" --learning_rate_te="1e-07" --lr_scheduler="constant" --lr_scheduler_args "milestones=[5000]" "gamma=0.2" --max_data_loader_n_workers="1" --resolution="512,768" --min_timestep=1 --max_timestep=1000 --max_token_length=225 --mixed_precision="bf16" --optimizer_type="lion8bit" --output_dir="C:\Train\train model" --output_name="test1" --pretrained_model_name_or_path="C:\Train\Model" --sample_every_n_steps="100" --save_every_n_steps="10000" --save_model_as=diffusers --save_precision="bf16" --shuffle_caption --train_batch_size="1" --train_data_dir="C:\Train\train model\img" --xformers --sample_sampler=euler_a --sample_prompts="C:\Train\train model\sample\prompt.txt" --noise_offset=0.1 --adaptive_noise_scale=0.01 --caption_tag_dropout_rate="0.5" --min_snr_gamma="5" --debiased_estimation_loss --max_train_steps="5000"