sd-scripts
sd-scripts copied to clipboard
New training broken on Kaggle due to DistributedDataParallel and torch.distributed.elastic.multiprocessing.api
I am trying to do multi gpu training on Kaggle
Previously it was working great
But after all these new changes I am getting below error
Traceback (most recent call last):
File "/kaggle/working/kohya_ss/sd-scripts/train_db.py", line 529, in <module>
train(args)
File "/kaggle/working/kohya_ss/sd-scripts/train_db.py", line 343, in train
encoder_hidden_states = train_util.get_hidden_states(
File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 4427, in get_hidden_states
encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'text_model'Traceback (most recent call last):
File "/kaggle/working/kohya_ss/sd-scripts/train_db.py", line 529, in <module>
train(args)
File "/kaggle/working/kohya_ss/sd-scripts/train_db.py", line 343, in train
encoder_hidden_states = train_util.get_hidden_states(
File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 4427, in get_hidden_states
encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'text_model'
steps: 0%| | 0/3000 [00:00<?, ?it/s]
[2024-04-18 00:21:49,711] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1114) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train command like this
Executing command: "/opt/conda/bin/accelerate" launch
--dynamo_backend no --dynamo_mode default --gpu_ids 0,1
--mixed_precision no --multi_gpu --num_processes 2
--num_machines 1 --num_cpu_threads_per_process 4
"/kaggle/working/kohya_ss/sd-scripts/train_db.py"
--config_file "./outputs/tmpfiledbooth.toml"
--max_grad_norm=0.0 --no_half_vae
--ddp_timeout=10000000 --ddp_gradient_as_bucket_view
even single GPU training fails on kaggle now
[2024-04-18 00:29:47,958] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1187 closing signal SIGTERM
[2024-04-18 00:29:48,123] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 1188) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
multi_gpu_launcher(args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
distrib_run.run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/kaggle/working/kohya_ss/sd-scripts/train_db.py FAILED
DDP training for fintune.py
or train_db.py
(SD1.5/2.0) and ckip_skip>=2 seems to cause this issue. Could you try without clip_skip?
If it works without ckip_skip, it is caused by accessing the inner layers of the model directly for the wrapped model by accelerator. It may need some investigations to solve the issue...
possibly duplicate of #1099
DDP training for
fintune.py
ortrain_db.py
(SD1.5/2.0) and ckip_skip>=2 seems to cause this issue. Could you try without clip_skip?If it works without ckip_skip, it is caused by accessing the inner layers of the model directly for the wrapped model by accelerator. It may need some investigations to solve the issue...
I didn't set clip skip. I use default value. After I selected single GPU P100 it worked. but with dual T4 gpu it always failed
yes my config has "clip_skip": 1,
i train only text encoder 1 and not text encoder 2
I had the same problem as you, may I ask how you eventually solved it.
I had the same problem as you, may I ask how you eventually solved it.
No I didn't
I used only single gpu to solve issue
Before these changes it was working perfect
After this topic I didn't try again either
Thank you for your reply, I will continue to look for a solution.