sd-scripts icon indicating copy to clipboard operation
sd-scripts copied to clipboard

New training broken on Kaggle due to DistributedDataParallel and torch.distributed.elastic.multiprocessing.api

Open FurkanGozukara opened this issue 10 months ago • 7 comments

I am trying to do multi gpu training on Kaggle

Previously it was working great

But after all these new changes I am getting below error

Traceback (most recent call last):
  File "/kaggle/working/kohya_ss/sd-scripts/train_db.py", line 529, in <module>
    train(args)
  File "/kaggle/working/kohya_ss/sd-scripts/train_db.py", line 343, in train
    encoder_hidden_states = train_util.get_hidden_states(
  File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 4427, in get_hidden_states
    encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'text_model'Traceback (most recent call last):
  File "/kaggle/working/kohya_ss/sd-scripts/train_db.py", line 529, in <module>

    train(args)
  File "/kaggle/working/kohya_ss/sd-scripts/train_db.py", line 343, in train
    encoder_hidden_states = train_util.get_hidden_states(
  File "/kaggle/working/kohya_ss/sd-scripts/library/train_util.py", line 4427, in get_hidden_states
    encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'text_model'
steps:   0%|                                           | 0/3000 [00:00<?, ?it/s]
[2024-04-18 00:21:49,711] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1114) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

train command like this

 Executing command: "/opt/conda/bin/accelerate" launch  
                         --dynamo_backend no --dynamo_mode default --gpu_ids 0,1
                         --mixed_precision no --multi_gpu --num_processes 2     
                         --num_machines 1 --num_cpu_threads_per_process 4       
                         "/kaggle/working/kohya_ss/sd-scripts/train_db.py"      
                         --config_file "./outputs/tmpfiledbooth.toml"           
                         --max_grad_norm=0.0 --no_half_vae                      
                         --ddp_timeout=10000000 --ddp_gradient_as_bucket_view 

FurkanGozukara avatar Apr 18 '24 00:04 FurkanGozukara

even single GPU training fails on kaggle now

[2024-04-18 00:29:47,958] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1187 closing signal SIGTERM
[2024-04-18 00:29:48,123] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 1188) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/kaggle/working/kohya_ss/sd-scripts/train_db.py FAILED

FurkanGozukara avatar Apr 18 '24 00:04 FurkanGozukara

DDP training for fintune.py or train_db.py (SD1.5/2.0) and ckip_skip>=2 seems to cause this issue. Could you try without clip_skip?

If it works without ckip_skip, it is caused by accessing the inner layers of the model directly for the wrapped model by accelerator. It may need some investigations to solve the issue...

kohya-ss avatar Apr 21 '24 11:04 kohya-ss

possibly duplicate of #1099

kohya-ss avatar Apr 21 '24 12:04 kohya-ss

DDP training for fintune.py or train_db.py (SD1.5/2.0) and ckip_skip>=2 seems to cause this issue. Could you try without clip_skip?

If it works without ckip_skip, it is caused by accessing the inner layers of the model directly for the wrapped model by accelerator. It may need some investigations to solve the issue...

I didn't set clip skip. I use default value. After I selected single GPU P100 it worked. but with dual T4 gpu it always failed

yes my config has "clip_skip": 1,

i train only text encoder 1 and not text encoder 2

FurkanGozukara avatar Apr 21 '24 12:04 FurkanGozukara

I had the same problem as you, may I ask how you eventually solved it.

Nice-Zhang66 avatar Jul 01 '24 07:07 Nice-Zhang66

I had the same problem as you, may I ask how you eventually solved it.

No I didn't

I used only single gpu to solve issue

Before these changes it was working perfect

After this topic I didn't try again either

FurkanGozukara avatar Jul 01 '24 08:07 FurkanGozukara

Thank you for your reply, I will continue to look for a solution.

Nice-Zhang66 avatar Jul 02 '24 09:07 Nice-Zhang66