SimpleTuner icon indicating copy to clipboard operation
SimpleTuner copied to clipboard

Error when using flow_schedule_auto_shift

Open StableLlama opened this issue 8 months ago • 2 comments

After 99 steps I get this error message:

Epoch 1/69, Steps:   1%|▏             | 99/10000 [41:26<63:52:24, 23.22s/it, grad_absmax=0.002, lr=0.000358, step_loss=0.356]

Calculate validation loss (clothing-eval):   0%|                                                       | 0/1 [00:00<?, ?it/s]Calculate validation loss (clothing-eval):   0%|                                                       | 0/1 [00:00<?, ?it/s]
`mu` must be passed when `use_dynamic_shifting` is set to be `True`
Traceback (most recent call last):
  File "/root/SimpleTuner/train.py", line 71, in <module>
    trainer.train()
  File "/root/SimpleTuner/helpers/training/trainer.py", line 2334, in train
    all_accumulated_losses = self.evaluation.execute_eval(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/SimpleTuner/helpers/training/validation.py", line 1696, in execute_eval
    ds_losses = self._evaluate_dataset_pass(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/SimpleTuner/helpers/training/validation.py", line 1567, in _evaluate_dataset_pass
    eval_timestep_list = self.get_timestep_schedule(noise_scheduler)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/SimpleTuner/helpers/training/validation.py", line 1522, in get_timestep_schedule
    noise_scheduler.set_timesteps(self.config.eval_timesteps)
  File "/usr/local/lib/python3.12/site-packages/diffusers/schedulers/scheduling_flow_match_euler_discrete.py", line 276, in set_timesteps
    raise ValueError("`mu` must be passed when `use_dynamic_shifting` is set to be `True`")
ValueError: `mu` must be passed when `use_dynamic_shifting` is set to be `True`

Epoch 1/69, Steps:   1%|▏            | 100/10000 [41:34<68:36:35, 24.95s/it, grad_absmax=0.002, lr=0.000358, step_loss=0.356]

The config file used is:

{
    "--resume_from_checkpoint": "latest",
    "--data_backend_config": "config/clothing_250501_local.json",
    "--aspect_bucket_rounding": 2,
    "--seed": 42,
    "--minimum_image_size": 0,
    "--disable_benchmark": false,
    "--output_dir": "/root/SimpleTuner/modal_output",
    "--lora_type": "lycoris",
    "--lycoris_config": "config/lycoris_config.json",
    "--num_train_epochs": 0,
    "--max_train_steps": 10000,
    "--ignore_final_epochs": true,
    "--checkpointing_steps": 100,
    "--validation_steps": 100,
    "--eval_steps_interval": 100,
    "--ema_update_interval": 1,
    "--checkpoints_total_limit": 20,
    "--attention_mechanism": "diffusers",

    "--hub_model_id": "roman_clothing",
    "--push_to_hub": "true",
    "--push_checkpoints_to_hub": "true",

    "--tracker_project_name": "stablellama-stable-llama",
    "--tracker_run_name": "clothing_250501-01",
    "--report_to": "wandb",
    "--model_type": "lora",
    "--pretrained_model_name_or_path": "/root/FLUX.1-dev/",
    "--model_family": "flux",
    "--train_batch_size": 4,
    "--gradient_accumulation_steps": "3",
    "--gradient_checkpointing": "true",
    "--caption_dropout_probability": 0.0,
    "--resolution_type": "pixel_area",
    "--resolution": 1024,
    "--validation_seed": 42,
    "--validation_resolution": "1024x1024",
    "--validation_guidance": 3.5,
    "--validation_guidance_rescale": "0.0",
    "--validation_num_inference_steps": "20",
    "--num_eval_images": 1,
    "--user_prompt_library": "config/user_prompt_library.json",
    "--validation_torch_compile": "false",
    "--mixed_precision": "bf16",
    "--optimizer": "adamw_bf16",
    "--learning_rate": "3e-5",
    "--lr_end": "1e-6",
    "--lr_scale": true,
    "--lr_scheduler": "polynomial",
    "--lr_warmup_steps": 25,
    "--base_model_precision": "int8-quanto",
    "--use_ema": true,
    "--ema_validation": "comparison",
    "--ema_decay": "0.999",
    "--flow_schedule_shift": 0,
    "--flow_schedule_auto_shift": true,
    "--flux_guidance_value": 1.0
}

StableLlama avatar May 02 '25 18:05 StableLlama

Running again, this time with reduced numbers for the different steps it's failing after 25 more steps. And the only number reduced to 25 is --eval_steps_interval - so I guess it must be an issue with the eval

StableLlama avatar May 02 '25 19:05 StableLlama

Changing the config to "--eval_steps_interval": 2500000 (i.e. effectively disabling eval) makes the model train normally, including generation of validation images et al.

=> This issue is caused by eval

StableLlama avatar May 02 '25 21:05 StableLlama