lora-scripts icon indicating copy to clipboard operation
lora-scripts copied to clipboard

多卡训练报错,单卡训练正常

Open lidisi8520 opened this issue 11 months ago • 0 comments

11:29:09-280078 INFO     Found 1 legal dataset
11:29:25-631481 INFO     Wrote promopts to file
                         D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250113-112909-promopt.txt
11:29:25-639480 INFO     Training started with config file / 训练开始,使用配置文件:
                         D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250113-112909.toml
11:29:25-648481 INFO     Using GPU(s) / 使用 GPU: ['0', '1', '2']
11:29:25-652481 INFO     Task 5c92af4b-71c1-48d1-a356-ecf9270c1918 created
W0113 11:29:28.772000 18304 torch\distributed\elastic\multiprocessing\redirects.py:28] NOTE: Redirects are currently not supported in Windows or MacOs.
W0113 11:29:30.883000 18304 torch\distributed\run.py:771] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
2025-01-13 11:29:41 INFO     Loading settings from                                                    train_util.py:3745
                             D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250
                             113-112909.toml...
2025-01-13 11:29:41 INFO     Loading settings from                                                    train_util.py:3745
                             D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250
                             113-112909.toml...
2025-01-13 11:29:41 INFO     Loading settings from                                                    train_util.py:3745
                             D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250
                             113-112909.toml...
                    INFO     D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250 train_util.py:3764
                             113-112909
                    INFO     D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250 train_util.py:3764
                             113-112909
                    INFO     D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20250 train_util.py:3764
                             113-112909
2025-01-13 11:29:41 INFO     prepare tokenizer                                                        train_util.py:4228
2025-01-13 11:29:41 INFO     prepare tokenizer                                                        train_util.py:4228
2025-01-13 11:29:41 INFO     prepare tokenizer                                                        train_util.py:4228
2025-01-13 11:29:42 INFO     update token length: 255                                                 train_util.py:4245
                    INFO     Using DreamBooth method.                                               train_network.py:172
2025-01-13 11:29:42 INFO     update token length: 255                                                 train_util.py:4245
                    INFO     Using DreamBooth method.                                               train_network.py:172
2025-01-13 11:29:42 INFO     update token length: 255                                                 train_util.py:4245
                    INFO     Using DreamBooth method.                                               train_network.py:172
                    INFO     prepare images.                                                          train_util.py:1573
                    INFO     found directory                                                          train_util.py:1520
                             D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\people\10_peopl
                             e contains 10 image files
                    INFO     prepare images.                                                          train_util.py:1573
                    INFO     100 train images with repeating.                                         train_util.py:1614
                    INFO     0 reg images.                                                            train_util.py:1617
                    INFO     found directory                                                          train_util.py:1520
                             D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\people\10_peopl
                             e contains 10 image files
                    WARNING  no regularization images / 正則化画像が見つかりませんでした              train_util.py:1622
                    INFO     100 train images with repeating.                                         train_util.py:1614
                    INFO     0 reg images.                                                            train_util.py:1617
                    INFO     prepare images.                                                          train_util.py:1573
                    WARNING  no regularization images / 正則化画像が見つかりませんでした              train_util.py:1622
                    INFO     [Dataset 0]                                                              config_util.py:565
                               batch_size: 1
                               resolution: (512, 768)
                               enable_bucket: True
                               network_multiplier: 1.0
                               min_bucket_reso: 256
                               max_bucket_reso: 1024
                               bucket_reso_steps: 64
                               bucket_no_upscale: True

                               [Subset 0 of Dataset 0]
                                 image_dir:
                             "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\people\10_peop
                             le"
                                 image_count: 10
                                 num_repeats: 10
                                 shuffle_caption: True
                                 keep_tokens: 0
                                 keep_tokens_separator: ,
                                 secondary_separator: None
                                 enable_wildcard: False
                                 caption_dropout_rate: 0.0
                                 caption_dropout_every_n_epoches: 0
                                 caption_tag_dropout_rate: 0.0
                                 caption_prefix: None
                                 caption_suffix: None
                                 color_aug: False
                                 flip_aug: False
                                 face_crop_aug_range: None
                                 random_crop: False
                                 token_warmup_min: 1,
                                 token_warmup_step: 0,
                                 is_reg: False
                                 class_tokens: people
                                 caption_extension: .txt


                    INFO     [Dataset 0]                                                              config_util.py:571
                    INFO     found directory                                                          train_util.py:1520
                             D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\people\10_peopl
                             e contains 10 image files
                    INFO     loading image sizes.                                                      train_util.py:854
                    INFO     [Dataset 0]                                                              config_util.py:565
                               batch_size: 1
                               resolution: (512, 768)
                               enable_bucket: True
                               network_multiplier: 1.0
                               min_bucket_reso: 256
                               max_bucket_reso: 1024
                               bucket_reso_steps: 64
                               bucket_no_upscale: True

                               [Subset 0 of Dataset 0]
                                 image_dir:
                             "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\people\10_peop
                             le"
                                 image_count: 10
                                 num_repeats: 10
                                 shuffle_caption: True
                                 keep_tokens: 0
                                 keep_tokens_separator: ,
                                 secondary_separator: None
                                 enable_wildcard: False
                                 caption_dropout_rate: 0.0
                                 caption_dropout_every_n_epoches: 0
                                 caption_tag_dropout_rate: 0.0
                                 caption_prefix: None
                                 caption_suffix: None
                                 color_aug: False
                                 flip_aug: False
                                 face_crop_aug_range: None
                                 random_crop: False
                                 token_warmup_min: 1,
                                 token_warmup_step: 0,
                                 is_reg: False
                                 class_tokens: people
                                 caption_extension: .txt


                    INFO     100 train images with repeating.                                         train_util.py:1614
                    INFO     [Dataset 0]                                                              config_util.py:571
                    INFO     0 reg images.                                                            train_util.py:1617
  0%|                                                                                           | 0/10 [00:00<?, ?it/s]                    INFO     loading image sizes.                                                      train_util.py:854
100%|████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9984.06it/s]
                    WARNING  no regularization images / 正則化画像が見つかりませんでした              train_util.py:1622
                    INFO     make buckets                                                              train_util.py:860
100%|████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9957.99it/s]
                    WARNING  min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is   train_util.py:877
                             set, because bucket reso is defined by image size automatically /
                             bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計
                             算されるため、min_bucket_resoとmax_bucket_resoは無視されます
                    INFO     make buckets                                                              train_util.py:860
                    INFO     [Dataset 0]                                                              config_util.py:565
                               batch_size: 1
                               resolution: (512, 768)
                               enable_bucket: True
                               network_multiplier: 1.0
                               min_bucket_reso: 256
                               max_bucket_reso: 1024
                               bucket_reso_steps: 64
                               bucket_no_upscale: True

                               [Subset 0 of Dataset 0]
                                 image_dir:
                             "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\people\10_peop
                             le"
                                 image_count: 10
                                 num_repeats: 10
                                 shuffle_caption: True
                                 keep_tokens: 0
                                 keep_tokens_separator: ,
                                 secondary_separator: None
                                 enable_wildcard: False
                                 caption_dropout_rate: 0.0
                                 caption_dropout_every_n_epoches: 0
                                 caption_tag_dropout_rate: 0.0
                                 caption_prefix: None
                                 caption_suffix: None
                                 color_aug: False
                                 flip_aug: False
                                 face_crop_aug_range: None
                                 random_crop: False
                                 token_warmup_min: 1,
                                 token_warmup_step: 0,
                                 is_reg: False
                                 class_tokens: people
                                 caption_extension: .txt


                    INFO     number of images (including repeats) /                                    train_util.py:906
                             各bucketの画像枚数(繰り返し回数を含む)
                    INFO     [Dataset 0]                                                              config_util.py:571
                    WARNING  min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is   train_util.py:877
                             set, because bucket reso is defined by image size automatically /
                             bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計
                             算されるため、min_bucket_resoとmax_bucket_resoは無視されます
                    INFO     bucket 0: resolution (384, 1024), count: 10                               train_util.py:911
                    INFO     loading image sizes.                                                      train_util.py:854
                    INFO     bucket 1: resolution (448, 768), count: 10                                train_util.py:911
                    INFO     number of images (including repeats) /                                    train_util.py:906
                             各bucketの画像枚数(繰り返し回数を含む)
                    INFO     bucket 2: resolution (448, 832), count: 40                                train_util.py:911
                    INFO     bucket 0: resolution (384, 1024), count: 10                               train_util.py:911
100%|████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9974.56it/s]
                    INFO     bucket 3: resolution (512, 704), count: 30                                train_util.py:911
                    INFO     bucket 1: resolution (448, 768), count: 10                                train_util.py:911
                    INFO     make buckets                                                              train_util.py:860
                    INFO     bucket 4: resolution (576, 576), count: 10                                train_util.py:911
                    INFO     bucket 2: resolution (448, 832), count: 40                                train_util.py:911
                    WARNING  min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is   train_util.py:877
                             set, because bucket reso is defined by image size automatically /
                             bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計
                             算されるため、min_bucket_resoとmax_bucket_resoは無視されます
                    INFO     mean ar error (without repeats): 0.01810373870920746                      train_util.py:916
                    INFO     bucket 3: resolution (512, 704), count: 30                                train_util.py:911
                    INFO     bucket 4: resolution (576, 576), count: 10                                train_util.py:911
                    INFO     number of images (including repeats) /                                    train_util.py:906
                             各bucketの画像枚数(繰り返し回数を含む)
                    INFO     preparing accelerator                                                  train_network.py:225
                    INFO     mean ar error (without repeats): 0.01810373870920746                      train_util.py:916
                    INFO     bucket 0: resolution (384, 1024), count: 10                               train_util.py:911
                    INFO     bucket 1: resolution (448, 768), count: 10                                train_util.py:911
                    INFO     bucket 2: resolution (448, 832), count: 40                                train_util.py:911
                    INFO     preparing accelerator                                                  train_network.py:225
                    INFO     bucket 3: resolution (512, 704), count: 30                                train_util.py:911
                    INFO     bucket 4: resolution (576, 576), count: 10                                train_util.py:911
                    INFO     mean ar error (without repeats): 0.01810373870920746                      train_util.py:916
[W113 11:29:42.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to [stable-diffusio.internal.chinacloudapp.cn]:62018 (system error: 10049 - ??????,?????????).
[W113 11:29:42.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to [stable-diffusio.internal.chinacloudapp.cn]:62018 (system error: 10049 - ??????,?????????).
                    INFO     preparing accelerator                                                  train_network.py:225
[W113 11:29:42.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to [stable-diffusio.internal.chinacloudapp.cn]:62018 (system error: 10049 - ??????,?????????).
[W113 11:30:03.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to stable-diffusio.internal.chinacloudapp.cn:62018 (system error: 10060 - ???????????????????????????,???????).
[W113 11:30:03.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to stable-diffusio.internal.chinacloudapp.cn:62018 (system error: 10060 - ???????????????????????????,???????).
[W113 11:30:03.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to stable-diffusio.internal.chinacloudapp.cn:62018 (system error: 10060 - ???????????????????????????,???????).
[E113 11:30:26.000000000 socket.cpp:753] [c10d] The client socket has failed to connect to any network address of (stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn, 62018).
[E113 11:30:26.000000000 socket.cpp:753] [c10d] The client socket has failed to connect to any network address of (stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn, 62018).
[E113 11:30:26.000000000 socket.cpp:753] [c10d] The client socket has failed to connect to any network address of (stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn, 62018).
Traceback (most recent call last):
Traceback (most recent call last):
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 1115, in <module>
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 1115, in <module>
    trainer.train(args)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 226, in train
        trainer.train(args)accelerator = train_util.prepare_accelerator(args)

  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 226, in train
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\library\train_util.py", line 4307, in prepare_accelerator
    accelerator = train_util.prepare_accelerator(args)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\library\train_util.py", line 4307, in prepare_accelerator
Traceback (most recent call last):
          File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 1115, in <module>
accelerator = Accelerator(accelerator = Accelerator(

  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\accelerator.py", line 383, in __init__
      File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\accelerator.py", line 383, in __init__
    trainer.train(args)    self.state = AcceleratorState(
self.state = AcceleratorState(
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 226, in train

  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 846, in __init__
      File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 846, in __init__
    accelerator = train_util.prepare_accelerator(args)    PartialState(cpu, **kwargs)
PartialState(cpu, **kwargs)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\library\train_util.py", line 4307, in prepare_accelerator

  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 211, in __init__
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 211, in __init__
        torch.distributed.init_process_group(backend=self.backend, **kwargs)torch.distributed.init_process_group(backend=self.backend, **kwargs)

      File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper
accelerator = Accelerator(
return func(*args, **kwargs)return func(*args, **kwargs)  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\accelerator.py", line 383, in __init__


      File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper
self.state = AcceleratorState(
func_return = func(*args, **kwargs)func_return = func(*args, **kwargs)  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 846, in __init__


      File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group
PartialState(cpu, **kwargs)
store, rank, world_size = next(rendezvous_iterator)
    store, rank, world_size = next(rendezvous_iterator)  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler

  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 211, in __init__
      File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)    torch.distributed.init_process_group(backend=self.backend, **kwargs)
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store

  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store
      File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper
    return TCPStore(    return func(*args, **kwargs)
return TCPStore(
torch.distributed
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper
.torch.distributed    DistNetworkError.func_return = func(*args, **kwargs): DistNetworkError
The client socket has failed to connect to any network address of (stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn, 62018). The client socket has failed to connect to stable-diffusio.internal.chinacloudapp.cn:62018 (system error: 10060 - ???????????????????????????,???????).:   File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group

The client socket has failed to connect to any network address of (stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn, 62018). The client socket has failed to connect to stable-diffusio.internal.chinacloudapp.cn:62018 (system error: 10060 - ???????????????????????????,???????).
store, rank, world_size = next(rendezvous_iterator)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 258, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 185, in _create_c10d_store
    return TCPStore(
torch.distributed.DistNetworkError: The client socket has failed to connect to any network address of (stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn, 62018). The client socket has failed to connect to stable-diffusio.internal.chinacloudapp.cn:62018 (system error: 10060 - ???????????????????????????,???????).
E0113 11:30:28.357000 18304 torch\distributed\elastic\multiprocessing\api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 15808) of binary: D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\python.exe
Traceback (most recent call last):
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\commands\launch.py", line 1116, in <module>
    main()
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\commands\launch.py", line 1112, in main
    launch_command(args)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\commands\launch.py", line 1097, in launch_command
    multi_gpu_launcher(args)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\commands\launch.py", line 734, in multi_gpu_launcher
    distrib_run.run(args)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\run.py", line 892, in run
    elastic_launch(
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\launcher\api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\launcher\api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./scripts/stable/train_network.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2025-01-13_11:30:28
  host      : stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 18976)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2025-01-13_11:30:28
  host      : stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 8328)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-13_11:30:28
  host      : stable-diffusio.ovbu0rvgww0ufjqj4ztrxqhyab.zqzx.internal.chinacloudapp.cn
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 15808)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
11:30:28-815865 ERROR    Training failed / 训练失败

这是我的报错信息,我使用单卡的时候,能正常进行训练,但是我使用多卡,训练就会出现问题,以下是我的训练参数:

model_train_type = "sd-lora"
pretrained_model_name_or_path = "D:/webui/sd-webui-aki-v4.6.1/models/Stable-diffusion/v1-5-pruned.safetensors"
resume = ""
v2 = false
train_data_dir = "D:/webui/lora-scripts-v1.10.0/lora-scripts-v1.10.0/train/people"
prior_loss_weight = 1
resolution = "512,768"
enable_bucket = true
min_bucket_reso = 256
max_bucket_reso = 1024
bucket_reso_steps = 64
bucket_no_upscale = true
output_name = "aki_1"
output_dir = "./output"
save_model_as = "safetensors"
save_precision = "fp16"
save_every_n_epochs = 2
save_state = false
max_train_epochs = 10
train_batch_size = 1
gradient_checkpointing = false
gradient_accumulation_steps = 1
network_train_unet_only = false
network_train_text_encoder_only = false
learning_rate = 0.0001
unet_lr = 0.0001
text_encoder_lr = 0.00001
lr_scheduler = "constant"
lr_warmup_steps = 0
optimizer_type = "AdamW8bit"
network_module = "networks.lora"
network_dim = 64
network_alpha = 32
log_with = "tensorboard"
log_prefix = ""
log_tracker_name = ""
logging_dir = "./logs"
caption_extension = ".txt"
shuffle_caption = false
weighted_captions = false
keep_tokens = 0
keep_tokens_separator = ","
max_token_length = 255
random_crop = false
seed = 1337
clip_skip = 2
mixed_precision = "fp16"
xformers = true
lowram = false
cache_latents = true
cache_latents_to_disk = true
cache_text_encoder_outputs = false
cache_text_encoder_outputs_to_disk = false
persistent_data_loader_workers = true
ddp_gradient_as_bucket_view = false
gpu_ids = [ "0", "1", "2", "3" ]

lidisi8520 avatar Jan 13 '25 06:01 lidisi8520