Qwen icon indicating copy to clipboard operation
Qwen copied to clipboard

fix(finetune_ds.sh): `zero3` conflict with `device_map`

Open tpoisonooo opened this issue 1 year ago • 2 comments

huggingface/transformers v4.30.0-最新版,都有这么一段:

        if device_map is not None:
            if low_cpu_mem_usage is None:
                low_cpu_mem_usage = True
..
        if low_cpu_mem_usage:
..
            if is_deepspeed_zero3_enabled():
                raise ValueError(
                    "DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`."
                )

https://github.com/huggingface/transformers/blob/a7cab3c283312b8d4de5df3bbe719971e24f4281/src/transformers/modeling_utils.py#L2847C1-L2861C18

导致 finetune 加载模型时 model = transformers.AutoModelForCausalLM.from_pretrained 会崩溃。 提示 low_cpu_mem_usage 和 deepspeed_zero3 冲突。

│ /root/miniconda3/envs/torch2/lib/python3.10/site-packages/transformers/modeling_utils.py:2364 in │
│ from_pretrained                                                                                  │
 ..
│   2363 │   │   │   if is_deepspeed_zero3_enabled():                                              │
│ ❱ 2364 │   │   │   │   raise ValueError(                                                         │
│   2365 │   │   │   │   │   "DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or  │
│   2366 │   │   │   │   )                                                                         │
│   2367 │   │   │   elif not is_accelerate_available():                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`.

无奈 qaq

tpoisonooo avatar Dec 20 '23 09:12 tpoisonooo

或者 low_cpu_mem_usage 主动设成 False ?

tpoisonooo avatar Dec 20 '23 09:12 tpoisonooo

使用 finetune/ds_config_zero2.json 依然会报错。

root@gpu-3:/pwd# bash finetune_ds.sh 
[2023-12-26 01:51:54,726] torch.distributed.run: [WARNING] 
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[2023-12-26 01:51:57,687] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-26 01:51:58,569] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-26 01:52:00,355] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-26 01:52:00,366] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-26 01:52:00,727] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-26 01:52:00,746] [INFO] [comm.py:637:init_distributed] cdb=None
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
[2023-12-26 01:52:01,949] [INFO] [comm.py:637:init_distributed] cdb=None
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
[2023-12-26 01:52:02,070] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-26 01:52:02,070] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Loading checkpoint shards:   0%|                                          | 0/15 [00:00<?, ?it/s]The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Loading checkpoint shards: 100%|█████████████████████████████████| 15/15 [00:10<00:00,  1.44it/s]
Loading checkpoint shards: 100%|█████████████████████████████████| 15/15 [00:10<00:00,  1.44it/s]
Loading checkpoint shards:  93%|██████████████████████████████▊  | 14/15 [00:09<00:00,  1.37it/s]Traceback (most recent call last):
  File "/pwd/finetune.py", line 360, in <module>
    train()
  File "/pwd/finetune.py", line 353, in train
Loading checkpoint shards:  93%|██████████████████████████████▊  | 14/15 [00:09<00:00,  1.40it/s]    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1687, in _inner_training_loop
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1210, in prepare
    raise ValueError(
ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`.
Traceback (most recent call last):
  File "/pwd/finetune.py", line 360, in <module>
    train()
  File "/pwd/finetune.py", line 353, in train
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1687, in _inner_training_loop
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1210, in prepare
    raise ValueError(
ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`.
Loading checkpoint shards: 100%|█████████████████████████████████| 15/15 [00:10<00:00,  1.47it/s]
Loading checkpoint shards: 100%|█████████████████████████████████| 15/15 [00:10<00:00,  1.48it/s]
Loading data...
Formatting inputs...Skip in lazy mode
Traceback (most recent call last):
  File "/pwd/finetune.py", line 360, in <module>
    train()
  File "/pwd/finetune.py", line 353, in train
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1687, in _inner_training_loop
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1210, in prepare
    raise ValueError(
ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`.
Traceback (most recent call last):
  File "/pwd/finetune.py", line 360, in <module>
    train()
  File "/pwd/finetune.py", line 353, in train
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1687, in _inner_training_loop
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1210, in prepare
    raise ValueError(
ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`.
[2023-12-26 01:52:14,748] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 949 closing signal SIGTERM
[2023-12-26 01:52:14,748] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 952 closing signal SIGTERM
[2023-12-26 01:52:15,062] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 950) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0a0+b5021ba', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-12-26_01:52:14
  host      : gpu-3
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 951)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-12-26_01:52:14
  host      : gpu-3
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 950)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

LucienShui avatar Dec 25 '23 17:12 LucienShui

使用 finetune/ds_config_zero2.json 依然会报错。

root@gpu-3:/pwd# bash finetune_ds.sh 
[2023-12-26 01:51:54,726] torch.distributed.run: [WARNING] 
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[2023-12-26 01:51:57,687] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-26 01:51:58,569] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-26 01:52:00,355] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-26 01:52:00,366] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-26 01:52:00,727] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-26 01:52:00,746] [INFO] [comm.py:637:init_distributed] cdb=None
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
[2023-12-26 01:52:01,949] [INFO] [comm.py:637:init_distributed] cdb=None
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
[2023-12-26 01:52:02,070] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-26 01:52:02,070] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Loading checkpoint shards:   0%|                                          | 0/15 [00:00<?, ?it/s]The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Loading checkpoint shards: 100%|█████████████████████████████████| 15/15 [00:10<00:00,  1.44it/s]
Loading checkpoint shards: 100%|█████████████████████████████████| 15/15 [00:10<00:00,  1.44it/s]
Loading checkpoint shards:  93%|██████████████████████████████▊  | 14/15 [00:09<00:00,  1.37it/s]Traceback (most recent call last):
  File "/pwd/finetune.py", line 360, in <module>
    train()
  File "/pwd/finetune.py", line 353, in train
Loading checkpoint shards:  93%|██████████████████████████████▊  | 14/15 [00:09<00:00,  1.40it/s]    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1687, in _inner_training_loop
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1210, in prepare
    raise ValueError(
ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`.
Traceback (most recent call last):
  File "/pwd/finetune.py", line 360, in <module>
    train()
  File "/pwd/finetune.py", line 353, in train
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1687, in _inner_training_loop
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1210, in prepare
    raise ValueError(
ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`.
Loading checkpoint shards: 100%|█████████████████████████████████| 15/15 [00:10<00:00,  1.47it/s]
Loading checkpoint shards: 100%|█████████████████████████████████| 15/15 [00:10<00:00,  1.48it/s]
Loading data...
Formatting inputs...Skip in lazy mode
Traceback (most recent call last):
  File "/pwd/finetune.py", line 360, in <module>
    train()
  File "/pwd/finetune.py", line 353, in train
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1687, in _inner_training_loop
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1210, in prepare
    raise ValueError(
ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`.
Traceback (most recent call last):
  File "/pwd/finetune.py", line 360, in <module>
    train()
  File "/pwd/finetune.py", line 353, in train
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1687, in _inner_training_loop
    model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1210, in prepare
    raise ValueError(
ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`.
[2023-12-26 01:52:14,748] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 949 closing signal SIGTERM
[2023-12-26 01:52:14,748] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 952 closing signal SIGTERM
[2023-12-26 01:52:15,062] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 950) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0a0+b5021ba', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-12-26_01:52:14
  host      : gpu-3
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 951)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-12-26_01:52:14
  host      : gpu-3
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 950)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

你这是新的报错了“You can't train a model that has been loaded with `device_map='auto'”

tpoisonooo avatar Jan 08 '24 03:01 tpoisonooo

所以这个问题如何解决呢?

SuooL avatar Jan 08 '24 08:01 SuooL

Hi, @tpoisonooo, thank you for this PR. The device_map problem in finetune.py should be fixed in main (it shouldn't be "auto" anyway) and low_cpu_mem_usage is set to True if proper conditions are met. I'm going to close this PR for now. Thanks again for the support!

jklj077 avatar Jan 11 '24 14:01 jklj077