Qwen
Qwen copied to clipboard
fix(finetune_ds.sh): `zero3` conflict with `device_map`
huggingface/transformers
v4.30.0-最新版,都有这么一段:
if device_map is not None:
if low_cpu_mem_usage is None:
low_cpu_mem_usage = True
..
if low_cpu_mem_usage:
..
if is_deepspeed_zero3_enabled():
raise ValueError(
"DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`."
)
https://github.com/huggingface/transformers/blob/a7cab3c283312b8d4de5df3bbe719971e24f4281/src/transformers/modeling_utils.py#L2847C1-L2861C18
导致 finetune 加载模型时 model = transformers.AutoModelForCausalLM.from_pretrained
会崩溃。
提示 low_cpu_mem_usage
和 deepspeed_zero3 冲突。
│ /root/miniconda3/envs/torch2/lib/python3.10/site-packages/transformers/modeling_utils.py:2364 in │
│ from_pretrained │
..
│ 2363 │ │ │ if is_deepspeed_zero3_enabled(): │
│ ❱ 2364 │ │ │ │ raise ValueError( │
│ 2365 │ │ │ │ │ "DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or │
│ 2366 │ │ │ │ ) │
│ 2367 │ │ │ elif not is_accelerate_available(): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.
无奈 qaq
或者 low_cpu_mem_usage
主动设成 False
?
使用 finetune/ds_config_zero2.json
依然会报错。
root@gpu-3:/pwd# bash finetune_ds.sh
[2023-12-26 01:51:54,726] torch.distributed.run: [WARNING]
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[2023-12-26 01:51:57,687] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-26 01:51:58,569] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-26 01:52:00,355] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-26 01:52:00,366] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-26 01:52:00,727] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-26 01:52:00,746] [INFO] [comm.py:637:init_distributed] cdb=None
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
[2023-12-26 01:52:01,949] [INFO] [comm.py:637:init_distributed] cdb=None
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
[2023-12-26 01:52:02,070] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-26 01:52:02,070] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Loading checkpoint shards: 0%| | 0/15 [00:00<?, ?it/s]The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Loading checkpoint shards: 100%|█████████████████████████████████| 15/15 [00:10<00:00, 1.44it/s]
Loading checkpoint shards: 100%|█████████████████████████████████| 15/15 [00:10<00:00, 1.44it/s]
Loading checkpoint shards: 93%|██████████████████████████████▊ | 14/15 [00:09<00:00, 1.37it/s]Traceback (most recent call last):
File "/pwd/finetune.py", line 360, in <module>
train()
File "/pwd/finetune.py", line 353, in train
Loading checkpoint shards: 93%|██████████████████████████████▊ | 14/15 [00:09<00:00, 1.40it/s] trainer.train()
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1687, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1210, in prepare
raise ValueError(
ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`.
Traceback (most recent call last):
File "/pwd/finetune.py", line 360, in <module>
train()
File "/pwd/finetune.py", line 353, in train
trainer.train()
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1687, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1210, in prepare
raise ValueError(
ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`.
Loading checkpoint shards: 100%|█████████████████████████████████| 15/15 [00:10<00:00, 1.47it/s]
Loading checkpoint shards: 100%|█████████████████████████████████| 15/15 [00:10<00:00, 1.48it/s]
Loading data...
Formatting inputs...Skip in lazy mode
Traceback (most recent call last):
File "/pwd/finetune.py", line 360, in <module>
train()
File "/pwd/finetune.py", line 353, in train
trainer.train()
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1687, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1210, in prepare
raise ValueError(
ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`.
Traceback (most recent call last):
File "/pwd/finetune.py", line 360, in <module>
train()
File "/pwd/finetune.py", line 353, in train
trainer.train()
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1687, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1210, in prepare
raise ValueError(
ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`.
[2023-12-26 01:52:14,748] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 949 closing signal SIGTERM
[2023-12-26 01:52:14,748] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 952 closing signal SIGTERM
[2023-12-26 01:52:15,062] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 950) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.1.0a0+b5021ba', 'console_scripts', 'torchrun')())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-12-26_01:52:14
host : gpu-3
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 951)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-12-26_01:52:14
host : gpu-3
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 950)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
使用
finetune/ds_config_zero2.json
依然会报错。root@gpu-3:/pwd# bash finetune_ds.sh [2023-12-26 01:51:54,726] torch.distributed.run: [WARNING] ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** [2023-12-26 01:51:57,687] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-26 01:51:58,569] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-26 01:52:00,355] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-26 01:52:00,366] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-26 01:52:00,727] [INFO] [comm.py:637:init_distributed] cdb=None [2023-12-26 01:52:00,746] [INFO] [comm.py:637:init_distributed] cdb=None The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". Try importing flash-attention for faster inference... Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". Try importing flash-attention for faster inference... Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm [2023-12-26 01:52:01,949] [INFO] [comm.py:637:init_distributed] cdb=None The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". Try importing flash-attention for faster inference... Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm [2023-12-26 01:52:02,070] [INFO] [comm.py:637:init_distributed] cdb=None [2023-12-26 01:52:02,070] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Loading checkpoint shards: 0%| | 0/15 [00:00<?, ?it/s]The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". Try importing flash-attention for faster inference... Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm Loading checkpoint shards: 100%|█████████████████████████████████| 15/15 [00:10<00:00, 1.44it/s] Loading checkpoint shards: 100%|█████████████████████████████████| 15/15 [00:10<00:00, 1.44it/s] Loading checkpoint shards: 93%|██████████████████████████████▊ | 14/15 [00:09<00:00, 1.37it/s]Traceback (most recent call last): File "/pwd/finetune.py", line 360, in <module> train() File "/pwd/finetune.py", line 353, in train Loading checkpoint shards: 93%|██████████████████████████████▊ | 14/15 [00:09<00:00, 1.40it/s] trainer.train() File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1555, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1687, in _inner_training_loop model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1210, in prepare raise ValueError( ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`. Traceback (most recent call last): File "/pwd/finetune.py", line 360, in <module> train() File "/pwd/finetune.py", line 353, in train trainer.train() File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1555, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1687, in _inner_training_loop model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1210, in prepare raise ValueError( ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`. Loading checkpoint shards: 100%|█████████████████████████████████| 15/15 [00:10<00:00, 1.47it/s] Loading checkpoint shards: 100%|█████████████████████████████████| 15/15 [00:10<00:00, 1.48it/s] Loading data... Formatting inputs...Skip in lazy mode Traceback (most recent call last): File "/pwd/finetune.py", line 360, in <module> train() File "/pwd/finetune.py", line 353, in train trainer.train() File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1555, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1687, in _inner_training_loop model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1210, in prepare raise ValueError( ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`. Traceback (most recent call last): File "/pwd/finetune.py", line 360, in <module> train() File "/pwd/finetune.py", line 353, in train trainer.train() File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1555, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1687, in _inner_training_loop model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1210, in prepare raise ValueError( ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`. [2023-12-26 01:52:14,748] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 949 closing signal SIGTERM [2023-12-26 01:52:14,748] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 952 closing signal SIGTERM [2023-12-26 01:52:15,062] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 950) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch==2.1.0a0+b5021ba', 'console_scripts', 'torchrun')()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ finetune.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2023-12-26_01:52:14 host : gpu-3 rank : 2 (local_rank: 2) exitcode : 1 (pid: 951) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-12-26_01:52:14 host : gpu-3 rank : 1 (local_rank: 1) exitcode : 1 (pid: 950) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
你这是新的报错了“You can't train a model that has been loaded with `device_map='auto'”
所以这个问题如何解决呢?
Hi, @tpoisonooo, thank you for this PR. The device_map
problem in finetune.py
should be fixed in main
(it shouldn't be "auto"
anyway) and low_cpu_mem_usage
is set to True
if proper conditions are met. I'm going to close this PR for now. Thanks again for the support!