lora-scripts icon indicating copy to clipboard operation
lora-scripts copied to clipboard

ValueError: Default process group has not been initialized, please make sure to call init_process_group.

Open lidisi8520 opened this issue 1 year ago • 5 comments

我是第一次使用这个工具,我在使用新手训练的时候,我选择了图片路径以及基础模型,其他参数都没有改动,直接训练,报错:ValueError: Default process group has not been initialized, please make sure to call init_process_group. 请问这个问题应该怎么解决呢?

lidisi8520 avatar Dec 25 '24 01:12 lidisi8520

  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 1115, in <module>
    trainer.train(args)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 226, in train
    accelerator = train_util.prepare_accelerator(args)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\library\train_util.py", line 4307, in prepare_accelerator
    accelerator = Accelerator(
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\accelerator.py", line 383, in __init__
    self.state = AcceleratorState(
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 846, in __init__
    PartialState(cpu, **kwargs)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 270, in __init__
    self.num_processes = torch.distributed.get_world_size()
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 1832, in get_world_size
    return _get_group_size(group)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 864, in _get_group_size
    default_pg = _get_default_group()
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 1025, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.```
这里是具体的报错信息

lidisi8520 avatar Dec 25 '24 01:12 lidisi8520

我在添加你所说的代码之后,接着进行训练,还是有报错。以下是报错信息:

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\anyio\streams\memory.py", line 94, in receive
    return self.receive_nowait()
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\anyio\streams\memory.py", line 89, in receive_nowait
    raise WouldBlock
anyio.WouldBlock

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\middleware\base.py", line 78, in call_next
    message = await recv_stream.receive()
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\anyio\streams\memory.py", line 114, in receive
    raise EndOfStream
anyio.EndOfStream

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\uvicorn\protocols\http\h11_impl.py", line 428, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\uvicorn\middleware\proxy_headers.py", line 78, in __call__
    return await self.app(scope, receive, send)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\fastapi\applications.py", line 276, in __call__
    await super().__call__(scope, receive, send)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\middleware\errors.py", line 184, in __call__
    raise exc
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\middleware\errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\middleware\base.py", line 108, in __call__
    response = await self.dispatch_func(request, call_next)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\mikazuki\app\application.py", line 74, in add_cache_control_header
    response = await call_next(request)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\middleware\base.py", line 84, in call_next
    raise app_exc
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\middleware\base.py", line 70, in coro
    await self.app(scope, receive_or_disconnect, send_no_error)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\middleware\exceptions.py", line 79, in __call__
    raise exc
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\middleware\exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\fastapi\middleware\asyncexitstack.py", line 21, in __call__
    raise e
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\fastapi\middleware\asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\starlette\routing.py", line 66, in app
    response = await func(request)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\fastapi\routing.py", line 237, in app
    raw_response = await run_endpoint_function(
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\fastapi\routing.py", line 163, in run_endpoint_function
    return await dependant.call(**values)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\mikazuki\app\api.py", line 122, in create_toml_file
    dist.init_process_group(backend='gloo')
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 79, in wrapper
    return func(*args, **kwargs)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\c10d_logger.py", line 93, in wrapper
    func_return = func(*args, **kwargs)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 1361, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 246, in _env_rendezvous_handler
    rank = int(_get_env_or_raise("RANK"))
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\rendezvous.py", line 231, in _get_env_or_raise
    raise _env_error(env_var)
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
2024-12-25 10:19:56 INFO     Loading settings from                                                    train_util.py:3745
                             D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20241
                             225-101943.toml...
2024-12-25 10:20:14 INFO     D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\config\autosave\20241 train_util.py:3764
                             225-101943
2024-12-25 10:20:14 INFO     prepare tokenizer                                                        train_util.py:4228
2024-12-25 10:20:15 INFO     update token length: 255                                                 train_util.py:4245
                    INFO     Using DreamBooth method.                                               train_network.py:172
                    INFO     prepare images.                                                          train_util.py:1573
                    INFO     found directory                                                          train_util.py:1520
                             D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\10_people\1_zkz
                             contains 169 image files
                    INFO     169 train images with repeating.                                         train_util.py:1614
                    INFO     0 reg images.                                                            train_util.py:1617
                    WARNING  no regularization images / 正則化画像が見つかりませんでした              train_util.py:1622
                    INFO     [Dataset 0]                                                              config_util.py:565
                               batch_size: 1
                               resolution: (512, 512)
                               enable_bucket: True
                               network_multiplier: 1.0
                               min_bucket_reso: 256
                               max_bucket_reso: 1024
                               bucket_reso_steps: 64
                               bucket_no_upscale: False

                               [Subset 0 of Dataset 0]
                                 image_dir:
                             "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\train\10_people\1_zk
                             z"
                                 image_count: 169
                                 num_repeats: 1
                                 shuffle_caption: True
                                 keep_tokens: 0
                                 keep_tokens_separator:
                                 secondary_separator: None
                                 enable_wildcard: False
                                 caption_dropout_rate: 0.0
                                 caption_dropout_every_n_epoches: 0
                                 caption_tag_dropout_rate: 0.0
                                 caption_prefix: None
                                 caption_suffix: None
                                 color_aug: False
                                 flip_aug: False
                                 face_crop_aug_range: None
                                 random_crop: False
                                 token_warmup_min: 1,
                                 token_warmup_step: 0,
                                 is_reg: False
                                 class_tokens: zkz
                                 caption_extension: .txt


                    INFO     [Dataset 0]                                                              config_util.py:571
                    INFO     loading image sizes.                                                      train_util.py:854
100%|█████████████████████████████████████████████████████████████████████████████| 169/169 [00:00<00:00, 11267.30it/s]
                    INFO     make buckets                                                              train_util.py:860
                    INFO     number of images (including repeats) /                                    train_util.py:906
                             各bucketの画像枚数(繰り返し回数を含む)
                    INFO     bucket 0: resolution (320, 704), count: 8                                 train_util.py:911
                    INFO     bucket 1: resolution (320, 768), count: 1                                 train_util.py:911
                    INFO     bucket 2: resolution (384, 640), count: 71                                train_util.py:911
                    INFO     bucket 3: resolution (448, 576), count: 68                                train_util.py:911
                    INFO     bucket 4: resolution (512, 512), count: 17                                train_util.py:911
                    INFO     bucket 5: resolution (576, 448), count: 2                                 train_util.py:911
                    INFO     bucket 6: resolution (640, 384), count: 2                                 train_util.py:911
                    INFO     mean ar error (without repeats): 0.043133719759432164                     train_util.py:916
                    INFO     preparing accelerator                                                  train_network.py:225
Traceback (most recent call last):
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 1115, in <module>
    trainer.train(args)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\train_network.py", line 226, in train
    accelerator = train_util.prepare_accelerator(args)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\scripts\stable\library\train_util.py", line 4307, in prepare_accelerator
    accelerator = Accelerator(
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\accelerator.py", line 383, in __init__
    self.state = AcceleratorState(
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 846, in __init__
    PartialState(cpu, **kwargs)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\accelerate\state.py", line 270, in __init__
    self.num_processes = torch.distributed.get_world_size()
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 1832, in get_world_size
    return _get_group_size(group)
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 864, in _get_group_size
    default_pg = _get_default_group()
  File "D:\webui\lora-scripts-v1.10.0\lora-scripts-v1.10.0\python\lib\site-packages\torch\distributed\distributed_c10d.py", line 1025, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

请问这个应该怎么解决呢?

lidisi8520 avatar Dec 25 '24 02:12 lidisi8520

我好像找到了解决这个问题的答案,是因为我没有选择要进行训练的显卡,在我选择了之后,能正常的进行训练。但是,我有四张显卡,如果我都选择了的话,也还是会报错。就是,我不能同时选择4张显卡,我不知道这是为什么。只能使用一张显卡进行训练

lidisi8520 avatar Dec 25 '24 03:12 lidisi8520

我想找到解决这个问题的答案,是因为我没有选择要进行训练的显卡,在我选择了之后,才能正常的进行训练。但是,我有四张显卡,如果我都选择了的话,也还是会报错。就是,我不能同时选择4张显卡,我不知道这是为什么。用一张显卡进行训练

我遇到了同样的问题,请问在哪里选择显卡?

xs315431 avatar Feb 20 '25 08:02 xs315431

我想找到解决这个问题的答案,是因为我没有选择要进行训练的显卡,在我选择了之后,才能正常的进行训练。但是,我有四张显卡,如果我都选择了的话,也还是会报错。就是,我不能同时选择4张显卡,我不知道这是为什么。用一张显卡进行训练

我遇到了同样的问题,请问在哪里选择显卡?

在专家训练最下面,会有一个选择显卡的下拉框(如果你是多显卡的话,会出现)。后续我有一个单显卡的服务器搭建,没有出现这个选择框

lidisi8520 avatar Feb 27 '25 03:02 lidisi8520