CogVLM icon indicating copy to clipboard operation
CogVLM copied to clipboard

LoRA 合并模型报错

Open HarrytheOrange opened this issue 10 months ago • 1 comments

System Info / 系統信息

4x A800

Who can help? / 谁可以帮助到您?

@1049451037

Information / 问题信息

  • [X] The official example scripts / 官方的示例脚本
  • [ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

torchrun --standalone --nnodes=1 --nproc-per-node=4 utils/merge_model.py --version base --from_pretrained /mnt/cache/huangzhiyuan/thudm/CogVLM-photograph/checkpoints/finetune-cogvlm-base-490-04-10-12-50

Traceback (most recent call last): File "/mnt/cache/huangzhiyuan/thudm/CogVLM-photograph/utils/merge_model.py", line 42, in main() File "/mnt/cache/huangzhiyuan/thudm/CogVLM-photograph/utils/merge_model.py", line 23, in main model, model_args = FineTuneTestCogVLMModel.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/sat/model/base_model.py", line 257, in from_pretrained mp_merge_model_rank0(model, model_full) File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/sat/mpu/operation.py", line 112, in mp_merge_model_rank0 iter_merge(model, model_full) File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/sat/mpu/operation.py", line 111, in iter_merge iter_merge(sub_new_model, sub_module) File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/sat/mpu/operation.py", line 111, in iter_merge iter_merge(sub_new_model, sub_module) File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/sat/mpu/operation.py", line 111, in iter_merge iter_merge(sub_new_model, sub_module) [Previous line repeated 5 more times] File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/sat/mpu/operation.py", line 110, in iter_merge p.data.copy_(torch.clone(np.data.cpu()).detach()) RuntimeError: The size of tensor a (1792) must match the size of tensor b (448) at non-singleton dimension 0 [2024-04-10 22:28:49,072] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1042137 closing signal SIGTERM [2024-04-10 22:28:49,072] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1042138 closing signal SIGTERM [2024-04-10 22:28:49,073] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1042139 closing signal SIGTERM [2024-04-10 22:28:49,964] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1042136) of binary: /mnt/cache/huangzhiyuan/env/thudm/bin/python Traceback (most recent call last): File "/mnt/cache/huangzhiyuan/env/thudm/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==2.2.0', 'console_scripts', 'torchrun')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/cache/huangzhiyuan/env/thudm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

utils/merge_model.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-04-10_22:28:49 host : pt-ryutnbhj-worker-0.pt-ryutnbhj.ns-operations-a5acdc67.svc.cluster.local rank : 0 (local_rank: 0) exitcode : 1 (pid: 1042136) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Expected behavior / 期待表现

正常合并模型

HarrytheOrange avatar Apr 10 '24 14:04 HarrytheOrange