lora-scripts icon indicating copy to clipboard operation
lora-scripts copied to clipboard

请问4090*4的工作站无法开启多卡训练怎么处理呢?根据之前的Issues改了一部分代码,仍然不能正常启用四张卡

Open utenkkekou opened this issue 1 year ago • 3 comments

`Traceback (most recent call last): File "/home/ubuntu/code/lora-scripts/./sd-scripts/train_db.py", line 529, in train(args) File "/home/ubuntu/code/lora-scripts/./sd-scripts/train_db.py", line 343, in train encoder_hidden_states = train_util.get_hidden_states( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/code/lora-scripts/sd-scripts/library/train_util.py", line 4428, in get_hidden_states encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/code/lora-scripts/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1688, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'DistributedDataParallel' object has no attribute 'text_model' Traceback (most recent call last): File "/home/ubuntu/code/lora-scripts/./sd-scripts/train_db.py", line 529, in train(args) File "/home/ubuntu/code/lora-scripts/./sd-scripts/train_db.py", line 343, in train encoder_hidden_states = train_util.get_hidden_states( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/code/lora-scripts/sd-scripts/library/train_util.py", line 4428, in get_hidden_states encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/code/lora-scripts/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1688, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'DistributedDataParallel' object has no attribute 'text_model' Traceback (most recent call last): File "/home/ubuntu/code/lora-scripts/./sd-scripts/train_db.py", line 529, in train(args) File "/home/ubuntu/code/lora-scripts/./sd-scripts/train_db.py", line 343, in train encoder_hidden_states = train_util.get_hidden_states( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/code/lora-scripts/sd-scripts/library/train_util.py", line 4428, in get_hidden_states encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/code/lora-scripts/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1688, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'DistributedDataParallel' object has no attribute 'text_model' Traceback (most recent call last): File "/home/ubuntu/code/lora-scripts/./sd-scripts/train_db.py", line 529, in train(args) File "/home/ubuntu/code/lora-scripts/./sd-scripts/train_db.py", line 343, in train encoder_hidden_states = train_util.get_hidden_states( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/code/lora-scripts/sd-scripts/library/train_util.py", line 4428, in get_hidden_states encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/code/lora-scripts/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1688, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'DistributedDataParallel' object has no attribute 'text_model' steps: 0%| | 0/400384 [00:01<?, ?it/s] [2024-07-01 17:02:53,653] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1151723) of binary: /home/ubuntu/code/lora-scripts/venv/bin/python Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/ubuntu/code/lora-scripts/venv/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1027, in main() File "/home/ubuntu/code/lora-scripts/venv/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1023, in main launch_command(args) File "/home/ubuntu/code/lora-scripts/venv/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1008, in launch_command multi_gpu_launcher(args) File "/home/ubuntu/code/lora-scripts/venv/lib/python3.11/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher distrib_run.run(args) File "/home/ubuntu/code/lora-scripts/venv/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/home/ubuntu/code/lora-scripts/venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/code/lora-scripts/venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./sd-scripts/train_db.py FAILED

Failures: [1]: time : 2024-07-01_17:02:53 host : ubun rank : 1 (local_rank: 1) exitcode : 1 (pid: 1151724) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-07-01_17:02:53 host : ubun rank : 2 (local_rank: 2) exitcode : 1 (pid: 1151725) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-07-01_17:02:53 host : ubun rank : 3 (local_rank: 3) exitcode : 1 (pid: 1151727) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-07-01_17:02:53 host : ubun rank : 0 (local_rank: 0) exitcode : 1 (pid: 1151723) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

17:02:54-224225 ERROR Training failed / 训练失败`

以上是报错

utenkkekou avatar Jul 01 '24 09:07 utenkkekou

同样的问题

changqingla avatar Oct 10 '24 03:10 changqingla

我也一样 无法开启多卡训练,不知道是什么原因。社区有人会解决这个问题吗?

lidisi8520 avatar Jan 13 '25 05:01 lidisi8520

same

liend123 avatar May 08 '25 03:05 liend123