kohya_ss
kohya_ss copied to clipboard
Multi-GPU not working on both Windows and Linux
Issue: Multi GPU training not working ever since the "accelerate launch" has been added
Machine I have tried:
- Windows 10, 2x RTX3090
- Ubuntu 22.04 server, run on docker, 6x RTX3090
- Ubuntu 22.04 Server, running on docker, 4xA100
Accelerate config: It looks it does not matter, the error message is the same no mater I config it to use distributed training or not. The config I used to use to run on multi-GPU was, "distributed training -yes", "dynamo, deepspeed etc... -no", and set the correct number of GPU, and the "ALL" to use all the GPU I got. The following error message was under this config.
Version The one I`m using now is v24.0.3 But this issue has been there ever since the accelerate launch has been added
Error Message: This is from the 4x A100 machine, running on docker
14:45:42-103931 INFO Start training LoRA Standard ...
14:45:42-105577 INFO Validating lr scheduler arguments...
14:45:42-106378 INFO Validating optimizer arguments...
14:45:42-107054 INFO Validating /dataset/lora/ruanmei/log/ existence and writability... SUCCESS
14:45:42-107750 INFO Validating /dataset/lora/ruanmei/model/ existence and writability...
SUCCESS
14:45:42-108480 INFO Validating /dataset/base_model/animagine-xl-3.0-base.safetensors
existence... SUCCESS
14:45:42-109179 INFO Validating /dataset/lora/ruanmei/img/ existence... SUCCESS
14:45:42-109829 INFO Validating /dataset/vae/sdxl_vae.safetensors existence... SUCCESS
14:45:42-110513 INFO Headless mode, skipping verification if model already exist... if model
already exist it will be overwritten...
14:45:42-111409 INFO Folder 3_ruan_mei_(honkai_star_rail) 1girl: 3 repeats found
14:45:42-112648 INFO Folder 3_ruan_mei_(honkai_star_rail) 1girl: 230 images found
14:45:42-113435 INFO Folder 3_ruan_mei_(honkai_star_rail) 1girl: 230 * 3 = 690 steps
14:45:42-114177 INFO Folder 2_ruan_mei_(honkai_star_rail) 1girl: 2 repeats found
14:45:42-115183 INFO Folder 2_ruan_mei_(honkai_star_rail) 1girl: 247 images found
14:45:42-115878 INFO Folder 2_ruan_mei_(honkai_star_rail) 1girl: 247 * 2 = 494 steps
14:45:42-116578 INFO Folder 5_ruan_mei_(honkai_star_rail) 1girl: 5 repeats found
14:45:42-117377 INFO Folder 5_ruan_mei_(honkai_star_rail) 1girl: 85 images found
14:45:42-118077 INFO Folder 5_ruan_mei_(honkai_star_rail) 1girl: 85 * 5 = 425 steps
14:45:42-118774 INFO Folder 6_ruan_mei_(honkai_star_rail) 1girl: 6 repeats found
14:45:42-119891 INFO Folder 6_ruan_mei_(honkai_star_rail) 1girl: 89 images found
14:45:42-120726 INFO Folder 6_ruan_mei_(honkai_star_rail) 1girl: 89 * 6 = 534 steps
14:45:42-121576 INFO Regulatization factor: 1
14:45:42-122336 INFO Total steps: 2143
14:45:42-123044 INFO Train batch size: 2
14:45:42-123756 INFO Gradient accumulation steps: 1
14:45:42-124479 INFO Epoch: 20
14:45:42-125167 INFO max_train_steps (2143 / 2 / 1 * 20 * 1) = 21430
14:45:42-126096 INFO stop_text_encoder_training = 0
14:45:42-126801 INFO lr_warmup_steps = 2143
14:45:42-128322 INFO Saving training config to
/dataset/lora/ruanmei/model/Char-HonkaiSR-Ruanmei-XL-V1_20240510-144542.jso
n...
14:45:42-129527 INFO Executing command: /home/1000/.local/bin/accelerate launch --dynamo_backend
no --dynamo_mode default --gpu_ids 0,1,2,3 --mixed_precision no --multi_gpu
--num_processes 4 --num_machines 1 --num_cpu_threads_per_process 2
/app/sd-scripts/sdxl_train_network.py --config_file
/dataset/lora/ruanmei/model//config_lora-20240510-144542.toml
14:45:42-131952 INFO Command executed.
2024-05-10 14:45:48 INFO Loading settings from train_util.py:3744
/dataset/lora/ruanmei/model//config_lora-20240510-14
4542.toml...
INFO /dataset/lora/ruanmei/model//config_lora-20240510-14 train_util.py:3763
4542
2024-05-10 14:45:48 INFO prepare tokenizers sdxl_train_util.py:134
2024-05-10 14:45:48 INFO Loading settings from train_util.py:3744
/dataset/lora/ruanmei/model//config_lora-20240510-14
4542.toml...
2024-05-10 14:45:48 INFO Loading settings from train_util.py:3744
/dataset/lora/ruanmei/model//config_lora-20240510-14
4542.toml...
INFO /dataset/lora/ruanmei/model//config_lora-20240510-14 train_util.py:3763
4542
INFO /dataset/lora/ruanmei/model//config_lora-20240510-14 train_util.py:3763
4542
2024-05-10 14:45:48 INFO prepare tokenizers sdxl_train_util.py:134
2024-05-10 14:45:48 INFO prepare tokenizers sdxl_train_util.py:134
2024-05-10 14:45:49 INFO Loading settings from train_util.py:3744
/dataset/lora/ruanmei/model//config_lora-20240510-14
4542.toml...
INFO /dataset/lora/ruanmei/model//config_lora-20240510-14 train_util.py:3763
4542
2024-05-10 14:45:49 INFO prepare tokenizers sdxl_train_util.py:134
Traceback (most recent call last):
File "/home/1000/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 793, in urlopen
response = self._make_request(
File "/home/1000/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 537, in _make_request
response = conn.getresponse()
File "/home/1000/.local/lib/python3.10/site-packages/urllib3/connection.py", line 466, in getresponse
httplib_response = super().getresponse()
File "/usr/local/lib/python3.10/http/client.py", line 1375, in getresponse
response.begin()
File "/usr/local/lib/python3.10/http/client.py", line 318, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python3.10/http/client.py", line 287, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
The above exception was the direct cause of the following exception:
urllib3.exceptions.ProxyError: ('Unable to connect to proxy', RemoteDisconnected('Remote end closed connection without response'))
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/1000/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/home/1000/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 847, in urlopen
retries = retries.increment(
File "/home/1000/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /openai/clip-vit-large-patch14/resolve/main/tokenizer_config.json (Caused by ProxyError('Unable to connect to proxy', RemoteDisconnected('Remote end closed connection without response')))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/app/sd-scripts/sdxl_train_network.py", line 185, in <module>
trainer.train(args)
File "/app/sd-scripts/train_network.py", line 154, in train
tokenizer = self.load_tokenizer(args)
File "/app/sd-scripts/sdxl_train_network.py", line 53, in load_tokenizer
tokenizer = sdxl_train_util.load_tokenizers(args)
File "/app/sd-scripts/library/sdxl_train_util.py", line 147, in load_tokenizers
tokenizer = CLIPTokenizer.from_pretrained(original_path)
File "/home/1000/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1969, in from_pretrained
resolved_config_file = cached_file(
File "/home/1000/.local/lib/python3.10/site-packages/transformers/utils/hub.py", line 398, in cached_file
resolved_file = hf_hub_download(
File "/home/1000/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/home/1000/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1238, in hf_hub_download
metadata = get_hf_file_metadata(
File "/home/1000/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/home/1000/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1631, in get_hf_file_metadata
r = _request_wrapper(
File "/home/1000/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper
response = _request_wrapper(
File "/home/1000/.local/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 408, in _request_wrapper
response = get_session().request(method=method, url=url, **params)
File "/home/1000/.local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/home/1000/.local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/home/1000/.local/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 67, in send
return super().send(request, *args, **kwargs)
File "/home/1000/.local/lib/python3.10/site-packages/requests/adapters.py", line 513, in send
raise ProxyError(e, request=request)
requests.exceptions.ProxyError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /openai/clip-vit-large-patch14/resolve/main/tokenizer_config.json (Caused by ProxyError('Unable to connect to proxy', RemoteDisconnected('Remote end closed connection without response')))"), '(Request ID: 1cd0ba3c-9d26-4b30-8881-e46f1ad80288)')
2024-05-10 14:45:49 INFO update token length: 225 sdxl_train_util.py:159
INFO Using DreamBooth method. train_network.py:172
2024-05-10 14:45:50 INFO prepare images. train_util.py:1572
INFO found directory train_util.py:1519
/dataset/lora/ruanmei/img/3_ruan_mei_(honkai_star_ra
il) 1girl contains 230 image files
INFO found directory train_util.py:1519
/dataset/lora/ruanmei/img/2_ruan_mei_(honkai_star_ra
il) 1girl contains 247 image files
INFO found directory train_util.py:1519
/dataset/lora/ruanmei/img/5_ruan_mei_(honkai_star_ra
il) 1girl contains 85 image files
INFO found directory train_util.py:1519
/dataset/lora/ruanmei/img/6_ruan_mei_(honkai_star_ra
il) 1girl contains 89 image files
INFO 2143 train images with repeating. train_util.py:1613
INFO 0 reg images. train_util.py:1616
WARNING no regularization images / train_util.py:1621
正則化画像が見つかりませんでした
[2024-05-10 14:45:50,157] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 704 closing signal SIGTERM
[2024-05-10 14:45:50,157] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 705 closing signal SIGTERM
[2024-05-10 14:45:50,157] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 706 closing signal SIGTERM
[2024-05-10 14:45:50,223] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 703) of binary: /usr/local/bin/python
Traceback (most recent call last):
File "/home/1000/.local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/1000/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/1000/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "/home/1000/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "/home/1000/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/1000/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/1000/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/app/sd-scripts/sdxl_train_network.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-05-10_14:45:50
host : 813dd376f19c
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 703)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
14:45:51-417026 INFO Training has ended.
The following is from the 6x RTX3090, on docker. Since it is too long that exceed the character limits, I`m pasting the last error message.
accelerator device: cuda:5
2024-05-10 15:02:59 INFO U-Net: <All keys matched sdxl_model_util.py:202
successfully>
INFO building text encoders sdxl_model_util.py:205
2024-05-10 15:03:00 INFO loading text encoders from sdxl_model_util.py:258
checkpoint
INFO text encoder 1: <All keys sdxl_model_util.py:272
matched successfully>
2024-05-10 15:03:02 INFO text encoder 2: <All keys sdxl_model_util.py:276
matched successfully>
INFO building VAE sdxl_model_util.py:279
2024-05-10 15:03:03 INFO loading VAE from checkpoint sdxl_model_util.py:284
INFO VAE: <All keys matched sdxl_model_util.py:287
successfully>
INFO load VAE: model_util.py:1268
/dataset/vae/sdxl_vae.safetensor
s
2024-05-10 15:03:04 INFO additional VAE loaded sdxl_train_util.py:128
[2024-05-10 15:03:11,271] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 675 closing signal SIGTERM
[2024-05-10 15:03:11,435] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -7) local_rank: 0 (pid: 672) of binary: /usr/local/bin/python
Traceback (most recent call last):
File "/home/1000/.local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/1000/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/1000/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "/home/1000/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "/home/1000/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/1000/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/1000/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
===================================================
/app/sd-scripts/sdxl_train_network.py FAILED
---------------------------------------------------
Failures:
[1]:
time : 2024-05-10_15:03:11
host : 507b354d3cab
rank : 1 (local_rank: 1)
exitcode : -7 (pid: 673)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 673
[2]:
time : 2024-05-10_15:03:11
host : 507b354d3cab
rank : 2 (local_rank: 2)
exitcode : -7 (pid: 674)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 674
[3]:
time : 2024-05-10_15:03:11
host : 507b354d3cab
rank : 4 (local_rank: 4)
exitcode : -7 (pid: 676)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 676
[4]:
time : 2024-05-10_15:03:11
host : 507b354d3cab
rank : 5 (local_rank: 5)
exitcode : -7 (pid: 677)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 677
---------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-05-10_15:03:11
host : 507b354d3cab
rank : 0 (local_rank: 0)
exitcode : -7 (pid: 672)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 672
===================================================
15:03:13-135365 INFO Training has ended.
Additional info Previously before the accelerate launch was introduced to the GUI, multi-GPU was working perfectly: all you need to do was just config accelerate and it will run smoothly. So I thought it might be a good idea to ignore the accelerate launch options, such as not check the muiti-GPU checkbox. But I was wrong, it will either run on single GPU, or just error. To check whether it is an individual case, I tried different machines and OS, but the error message is very similar: it is always the “torch.distributed.elastic.multiprocessing.errors.ChildFailedError" , the only difference is the “exitcode : 1 (pid: 703)” might be other numbers
Please help
I am not sure if this is the same issue or a different one, but yeah, trying Multi-GPU SDXL finetuning on the same linux machine and got errors and can not proceed.
caching latents...
42%|████████████████████████████████████████████████████████████████████▎ | 9588/23008 [27:23<32:18, 6.92it/s][2024-05-19 06:17:40,266] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1242468 closing signal SIGTERM
[2024-05-19 06:17:40,323] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1242469 closing signal SIGTERM
[2024-05-19 06:17:40,327] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1242471 closing signal SIGTERM
[2024-05-19 06:17:42,066] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 2 (pid: 1242470) of binary: /home/ubuntu/train/kohya_ss_22.4.1/venv/bin/python
Traceback (most recent call last):
File "/home/ubuntu/train/kohya_ss_22.4.1/venv/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/ubuntu/train/kohya_ss_22.4.1/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/ubuntu/train/kohya_ss_22.4.1/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command
multi_gpu_launcher(args)
File "/home/ubuntu/train/kohya_ss_22.4.1/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/home/ubuntu/train/kohya_ss_22.4.1/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/ubuntu/train/kohya_ss_22.4.1/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/train/kohya_ss_22.4.1/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
./sdxl_train.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-05-19_06:17:40
host : ubuntutrain
rank : 2 (local_rank: 2)
exitcode : -9 (pid: 1242470)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 1242470
========================================================