kohya_ss
kohya_ss copied to clipboard
although folder preparation button is resumed, image folder still does not work
see, I do not know what it is wrong
Traceback (most recent call last):
File "/kaggle/working/kohya_ss/sd-scripts/train_network.py", line 1115, in
trainer.train(args)
File "/kaggle/working/kohya_ss/sd-scripts/train_network.py", line 234, in train
model_version, text_encoder, vae, unet = self.load_target_model(args, weight_dtype, accelerator)
File "/kaggle/working/kohya_ss/sd-scripts/train_network.py", line 101, in load_target_model
text_encoder, vae, unet, _ = train_util.load_target_model(args, weight_dtype, accelerator)
File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 4387, in load_target_model
text_encoder, vae, unet, load_stable_diffusion_format = _load_target_model(
File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 4363, in _load_target_model
original_unet = UNet2DConditionModel(
File "/kaggle/working/kohya_ss/sd-scripts/library/original_unet.py", line 1427, in init
attn_num_head_channels=attention_head_dim[i],
IndexError: list index out of range
[2024-04-09 05:08:36,532] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1114) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/accelerate", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/kaggle/working/kohya_ss/sd-scripts/train_network.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2024-04-09_05:08:36 host : c9fb5313e204 rank : 0 (local_rank: 0) exitcode : 1 (pid: 1114) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
seeming the same errors like https://github.com/bmaltais/kohya_ss/issues/2244
Not sure what cause this error… it is caused by the training script and I don’t think I can do anything about it. You might want to open an issue directly on the as-scripts repo.
okay
File "/kaggle/working/kohya_ss/sd-scripts/sdxl_train_network.py", line 185, in
trainer.train(args)
File "/kaggle/working/kohya_ss/sd-scripts/train_network.py", line 272, in train
train_dataset_group.cache_latents(vae, args.vae_batch_size, args.cache_latents_to_disk, accelerator.is_main_process)
File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 2080, in cache_latents
dataset.cache_latents(vae, vae_batch_size, cache_to_disk, is_main_process)
File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 1023, in cache_latents
cache_batch_latents(vae, cache_to_disk, batch, subset.flip_aug, subset.random_crop)
File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 2428, in cache_batch_latents
raise RuntimeError(f"NaN detected in latents: {info.absolute_path}")
RuntimeError: NaN detected in latents: /kaggle/working/results/img/25_ohwx tanglaoya/1 (1)_resized.png
[2024-04-11 07:13:14,425] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1115) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/accelerate", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/kaggle/working/kohya_ss/sd-scripts/sdxl_train_network.py FAILED, I checked several threads here and found out some people come across the same problem with me, so really puzzled
Same error, launched train_network.py
directly without guid
Same error, launched
train_network.py
directly without guid
It was a stupid mistake, I forgot to change the script name to sdxl_
version