sd-scripts icon indicating copy to clipboard operation
sd-scripts copied to clipboard

A deep train failure with sd3 position embeds.

Open AbstractEyes opened this issue 8 months ago • 5 comments

                                          | 608M/2.43G [00:13<00:35, 51.0MB/s][rank1]:     noise_pred, target, timesteps, weighting = self.get_noise_pred_and_target(                                                                   | 560M/2.43G [00:13<00:39, 47.9MB/s][rank1]:   File "/workspace/kohya_ss/sd-scripts/sd3_train_network.py", line 355, in get_noise_pred_and_target                                             | 288M/2.43G [00:07<00:49, 43.1MB/s][rank1]:     model_pred = unet(noisy_model_input, timesteps, context=context, y=lg_pooled)                                                                | 240M/2.43G [00:06<00:55, 39.5MB/s][rank1]:   File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl                      | 192M/2.43G [00:04<00:40, 55.5MB/s][rank1]:     return self._call_impl(*args, **kwargs)0%|â                                                                                                 | 5.29M/2.43G [00:00<01:28, 27.5MB/s]
[rank1]:   File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl                                  | 12/20 [00:03<00:02,  3.28it/s]
[rank1]:     return forward_call(*args, **kwargs)âââââââââââââââââââââââââââââââââââââââââ                                                                    | 11/20 [00:03<00:02,  3.28it/s]
[rank1]:   File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 819, in forward                                   | 3/20 [00:00<00:04,  3.46it/s]
[rank1]:     return model_forward(*args, **kwargs)                                                                                                             | 2/20 [00:00<00:05,  3.57it/s]
[rank1]:   File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 807, in __call__                                  | 2/20 [00:00<00:04,  3.68it/s]
[rank1]:     return convert_to_fp32(self.model_forward(*args, **kwargs))                                                                                               | 0/20 [00:00<?, ?it/s]
[rank1]:   File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast                              | 1/20 [00:00<00:04,  4.09it/s]
[rank1]:     return func(*args, **kwargs)ââ                                                                                     | 15976/60803 [43:01:40<120:43:53,  9.70s/it, avr_loss=0.0703]
[rank1]:   File "/workspace/kohya_ss/sd-scripts/library/sd3_models.py", line 1124, in forward                                   | 15976/60803 [43:01:40<120:43:53,  9.70s/it, avr_loss=0.0703]
[rank1]:     pos_embed = self.cropped_pos_embed(H, W, device=x.device, random_crop=pos_emb_random_crop).to(dtype=x.dtype)       | 15975/60803 [43:01:30<120:44:04,  9.70s/it, avr_loss=0.0703]
[rank1]:   File "/workspace/kohya_ss/sd-scripts/library/sd3_models.py", line 989, in cropped_pos_embed                          | 15975/60803 [43:01:30<120:44:04,  9.70s/it, avr_loss=0.0703]
[rank1]:     assert w <= self.pos_embed_max_size, (w, self.pos_embed_max_size)                                                  | 15974/60803 [43:01:20<120:44:13,  9.70s/it, avr_loss=0.0703]
[rank1]: AssertionError: (488, 384)ââââââââ                                                                                     | 15974/60803 [43:01:20<120:44:13,  9.70s/it, avr_loss=0.0703]
W0411 18:22:11.053000 15200 venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 15270 closing signal SIGTERM2,  9.70s/it, avr_loss=0.0703]
W0411 18:22:11.063000 15200 venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 15272 closing signal SIGTERM2,  9.70s/it, avr_loss=0.0703]
W0411 18:22:11.073000 15200 venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 15273 closing signal SIGTERM2,  9.70s/it, avr_loss=0.0703]
W0411 18:22:11.081000 15200 venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 15274 closing signal SIGTERM2,  9.70s/it, avr_loss=0.0703]
E0411 18:22:13.920000 15200 venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 15271) of binary: /workspace/kohya_ss/venv/bin/python3ââââââââââââââââââââââ                                                                                     | 15971/60803 [43:00:51<120:44:42,  9.70s/it, avr_loss=0.0703]
Traceback (most recent call last):âââââââââ                                                                                     | 15970/60803 [43:00:42<120:44:53,  9.70s/it, avr_loss=0.0703]
  File "/workspace/kohya_ss/venv/bin/accelerate", line 8, in <module>                                                           | 15970/60803 [43:00:42<120:44:53,  9.70s/it, avr_loss=0.0703]
    sys.exit(main())âââââââââââââââââââââââ                                                                                     | 15969/60803 [43:00:33<120:45:03,  9.70s/it, avr_loss=0.0703]
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main          | 15969/60803 [43:00:33<120:45:03,  9.70s/it, avr_loss=0.0703]
    args.func(args)ââââââââââââââââââââââââ                                                                                     | 15968/60803 [43:00:23<120:45:13,  9.70s/it, avr_loss=0.0703]
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1097, in launch_command      | 15968/60803 [43:00:23<120:45:13,  9.70s/it, avr_loss=0.0703]
    multi_gpu_launcher(args)âââââââââââââââ                                                                                     | 15967/60803 [43:00:13<120:45:22,  9.70s/it, avr_loss=0.0703]
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher   | 15967/60803 [43:00:13<120:45:22,  9.70s/it, avr_loss=0.0703]
    distrib_run.run(args)ââââââââââââââââââ                                                                                     | 15966/60803 [43:00:03<120:45:32,  9.70s/it, avr_loss=0.0703]
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run                       | 15966/60803 [43:00:03<120:45:32,  9.70s/it, avr_loss=0.0703]
    elastic_launch(ââââââââââââââââââââââââ                                                                                     | 15965/60803 [42:59:54<120:45:42,  9.70s/it, avr_loss=0.0703]
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__         | 15965/60803 [42:59:54<120:45:42,  9.70s/it, avr_loss=0.0703]
    return launch_agent(self._config, self._entrypoint, list(args))                                                             | 15964/60803 [42:59:44<120:45:52,  9.70s/it, avr_loss=0.0703]
  File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent     | 15964/60803 [42:59:44<120:45:52,  9.70s/it, avr_loss=0.0703]
    raise ChildFailedError(ââââââââââââââââ                                                                                     | 15963/60803 [42:59:34<120:46:01,  9.70s/it, avr_loss=0.0703]
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:                                                              | 15963/60803 [42:59:34<120:46:01,  9.70s/it, avr_loss=0.0703]
============================================================                                                                    | 15962/60803 [42:59:25<120:46:12,  9.70s/it, avr_loss=0.0703]
/workspace/kohya_ss/sd-scripts/sd3_train_network.py FAILED                                                                      | 15962/60803 [42:59:25<120:46:12,  9.70s/it, avr_loss=0.0703]
------------------------------------------------------------                                                                    | 15961/60803 [42:59:15<120:46:21,  9.70s/it, avr_loss=0.0703]
Failures:6%|âââââââââââââââââââââââââââââââ                                                                                     | 15961/60803 [42:59:15<120:46:21,  9.70s/it, avr_loss=0.0703]
  <NO_OTHER_FAILURES>ââââââââââââââââââââââ                                                                                     | 15960/60803 [42:59:05<120:46:31,  9.70s/it, avr_loss=0.0703]
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-04-11_18:22:11
  host      : 6396c67933b1
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 15271)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
18:22:15-890513 INFO     Training has ended.
^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[[A

These things aren't cheap for me. This halts it for nearly another 4 hours to cache; + the 5 before I noticed.

The logs show nothing. They show the thing was trying to inference the model and then it just kinda shit itself while training positional embeds.

Assertions are fast, but unreliable at times. Pure logic is choice in many situations, and this is one of those expensive situations where pure logic would be cheaper than an assertion. That's a million images; 12 batch size, 5 l40s, and nearly an epoch and a half. SO it made it that far before it shit self, but definitely not a good sign.

AbstractEyes avatar Apr 12 '25 01:04 AbstractEyes

Yeah it's just continually crashing now due to some sort of math not lining up.

AbstractEyes avatar Apr 12 '25 03:04 AbstractEyes

is it still crashing because, i m doing finetuning of sd3 but it didnt crashed for me so, maybe you are doing something wrong

vikas784 avatar Apr 21 '25 12:04 vikas784

is it still crashing because, i m doing finetuning of sd3 but it didnt crashed for me so, maybe you are doing something wrong

Seems likely to be a BF16 conversion error within the methods used to process position_embeds. I switched t5xxl-unchained to fp16 and it stopped imploding. However at that point, it seemed to not clean vram properly, so I only would get about 1 epoch before it ran out of vram and died. Which is all I really needed. I was most definitely training the t5xxl.

AbstractEyes avatar Apr 21 '25 15:04 AbstractEyes

Actually I posted a thread of how to make dataset and how to fine-tune it, thats what I did, they way is working for me let's see how the results will come

vikas784 avatar Apr 21 '25 15:04 vikas784

Can you tell how u kept the dataset and what preprocessing u did, please and how you ran the code because in my case training worked but results are not at much good

vikas784 avatar Apr 22 '25 05:04 vikas784