SimpleTuner
SimpleTuner copied to clipboard
Error when using flow_schedule_auto_shift
After 99 steps I get this error message:
Epoch 1/69, Steps: 1%|▏ | 99/10000 [41:26<63:52:24, 23.22s/it, grad_absmax=0.002, lr=0.000358, step_loss=0.356]
Calculate validation loss (clothing-eval): 0%| | 0/1 [00:00<?, ?it/s]Calculate validation loss (clothing-eval): 0%| | 0/1 [00:00<?, ?it/s]
`mu` must be passed when `use_dynamic_shifting` is set to be `True`
Traceback (most recent call last):
File "/root/SimpleTuner/train.py", line 71, in <module>
trainer.train()
File "/root/SimpleTuner/helpers/training/trainer.py", line 2334, in train
all_accumulated_losses = self.evaluation.execute_eval(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/SimpleTuner/helpers/training/validation.py", line 1696, in execute_eval
ds_losses = self._evaluate_dataset_pass(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/SimpleTuner/helpers/training/validation.py", line 1567, in _evaluate_dataset_pass
eval_timestep_list = self.get_timestep_schedule(noise_scheduler)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/SimpleTuner/helpers/training/validation.py", line 1522, in get_timestep_schedule
noise_scheduler.set_timesteps(self.config.eval_timesteps)
File "/usr/local/lib/python3.12/site-packages/diffusers/schedulers/scheduling_flow_match_euler_discrete.py", line 276, in set_timesteps
raise ValueError("`mu` must be passed when `use_dynamic_shifting` is set to be `True`")
ValueError: `mu` must be passed when `use_dynamic_shifting` is set to be `True`
Epoch 1/69, Steps: 1%|▏ | 100/10000 [41:34<68:36:35, 24.95s/it, grad_absmax=0.002, lr=0.000358, step_loss=0.356]
The config file used is:
{
"--resume_from_checkpoint": "latest",
"--data_backend_config": "config/clothing_250501_local.json",
"--aspect_bucket_rounding": 2,
"--seed": 42,
"--minimum_image_size": 0,
"--disable_benchmark": false,
"--output_dir": "/root/SimpleTuner/modal_output",
"--lora_type": "lycoris",
"--lycoris_config": "config/lycoris_config.json",
"--num_train_epochs": 0,
"--max_train_steps": 10000,
"--ignore_final_epochs": true,
"--checkpointing_steps": 100,
"--validation_steps": 100,
"--eval_steps_interval": 100,
"--ema_update_interval": 1,
"--checkpoints_total_limit": 20,
"--attention_mechanism": "diffusers",
"--hub_model_id": "roman_clothing",
"--push_to_hub": "true",
"--push_checkpoints_to_hub": "true",
"--tracker_project_name": "stablellama-stable-llama",
"--tracker_run_name": "clothing_250501-01",
"--report_to": "wandb",
"--model_type": "lora",
"--pretrained_model_name_or_path": "/root/FLUX.1-dev/",
"--model_family": "flux",
"--train_batch_size": 4,
"--gradient_accumulation_steps": "3",
"--gradient_checkpointing": "true",
"--caption_dropout_probability": 0.0,
"--resolution_type": "pixel_area",
"--resolution": 1024,
"--validation_seed": 42,
"--validation_resolution": "1024x1024",
"--validation_guidance": 3.5,
"--validation_guidance_rescale": "0.0",
"--validation_num_inference_steps": "20",
"--num_eval_images": 1,
"--user_prompt_library": "config/user_prompt_library.json",
"--validation_torch_compile": "false",
"--mixed_precision": "bf16",
"--optimizer": "adamw_bf16",
"--learning_rate": "3e-5",
"--lr_end": "1e-6",
"--lr_scale": true,
"--lr_scheduler": "polynomial",
"--lr_warmup_steps": 25,
"--base_model_precision": "int8-quanto",
"--use_ema": true,
"--ema_validation": "comparison",
"--ema_decay": "0.999",
"--flow_schedule_shift": 0,
"--flow_schedule_auto_shift": true,
"--flux_guidance_value": 1.0
}
Running again, this time with reduced numbers for the different steps it's failing after 25 more steps. And the only number reduced to 25 is --eval_steps_interval - so I guess it must be an issue with the eval
Changing the config to "--eval_steps_interval": 2500000 (i.e. effectively disabling eval) makes the model train normally, including generation of validation images et al.
=> This issue is caused by eval