V100（8 * 32G）运行报错

Open yihuaxiang opened this issue 1 year ago • 14 comments

下载模型并解压后修改配置并运行bash scripts/generate.sh --input-source interactive 后报错

配置修改内容如下：

(py39) [root@iZbp1219pbxs72sxk8onovZ GLM-130B]# git diff diff --git a/configs/model_glm_130b_v100.sh b/configs/model_glm_130b_v100.sh index 0b33485..1a474a8 100644 --- a/configs/model_glm_130b_v100.sh +++ b/configs/model_glm_130b_v100.sh @@ -1,5 +1,5 @@ MODEL_TYPE="glm-130b" -CHECKPOINT_PATH="" +CHECKPOINT_PATH="/root/130b/glm-130b-sat" MP_SIZE=8 MODEL_ARGS="--model-parallel-size ${MP_SIZE}
--num-layers 70
diff --git a/scripts/generate.sh b/scripts/generate.sh index 19bef0a..4732652 100644 --- a/scripts/generate.sh +++ b/scripts/generate.sh @@ -4,7 +4,7 @@ script_path=$(realpath $0) script_dir=$(dirname $script_path) main_dir=$(dirname $script_dir)

-source "${main_dir}/configs/model_glm_130b.sh" +source "${main_dir}/configs/model_glm_130b_v100.sh"

SEED=1234 MAX_OUTPUT_LENGTH=256

只修改了两个文件：

configs/model_glm_130b_v100.sh ，这个文件修改了 CHECKPOINT_PATH
scripts/generate.sh 这个文件将 model_glm_130b.sh 改成了 model_glm_130b_v100.sh ，其他的没有修改

报错信息如下：

核心报错信息：

Traceback (most recent call last): File "/root/miniconda3/envs/py39/bin/torchrun", line 8, in sys.exit(main()) File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

完整报错信息

(py39) [root@iZbp1219pbxs72sxk8onovZ GLM-130B]# bash scripts/generate.sh --input-source interactive WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect) Setting ds_accelerator to cuda (auto detect) WARNING: No training data specified WARNING: No training data specified WARNING: No training data specified WARNING: No training data specified WARNING: No training data specified WARNING: No training data specified WARNING: No training data specified WARNING: No training data specified using world size: 8 and model-parallel size: 8

padded vocab (size: 150528) with 0 dummy tokens (new size: 150528) initializing model parallel with size 8 Set tokenizer as a icetk-glm-130B tokenizer! Now you can get_tokenizer() everywhere. global rank 7 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_07_model_states.pt global rank 0 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_00_model_states.pt global rank 4 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_04_model_states.pt global rank 1 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_01_model_states.pt global rank 6 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_06_model_states.pt global rank 3 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_03_model_states.pt global rank 2 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_02_model_states.pt global rank 5 is loading checkpoint /root/130b/glm-130b-sat/49300/mp_rank_05_model_states.pt successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_01_model_states.pt successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_06_model_states.pt successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_00_model_states.pt successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_04_model_states.pt successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_03_model_states.pt /root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() total_size += param.numel() * param.storage().element_size() successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_02_model_states.pt /root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() total_size += param.numel() * param.storage().element_size() successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_07_model_states.pt BMInf activated, memory limit: 25 GB /root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() total_size += param.numel() * param.storage().element_size() /root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() total_size += param.numel() * param.storage().element_size() successfully loaded /root/130b/glm-130b-sat/49300/mp_rank_05_model_states.pt /root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() total_size += param.numel() * param.storage().element_size() /root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() total_size += param.numel() * param.storage().element_size() /root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() total_size += param.numel() * param.storage().element_size() /root/miniconda3/envs/py39/lib/python3.9/site-packages/bminf/scheduler/init.py:221: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() total_size += param.numel() * param.storage().element_size() Model initialized in 121.0s /root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar. self._is_done = np.zeros(self.batch_size, dtype=np.bool) /root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar. self._is_done = np.zeros(self.batch_size, dtype=np.bool) /root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar. self._is_done = np.zeros(self.batch_size, dtype=np.bool) /root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar. self._is_done = np.zeros(self.batch_size, dtype=np.bool) /root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar. self._is_done = np.zeros(self.batch_size, dtype=np.bool) /root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar. self._is_done = np.zeros(self.batch_size, dtype=np.bool) /root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar. self._is_done = np.zeros(self.batch_size, dtype=np.bool) /root/GLM-130B/generation/strategies.py:17: FutureWarning: In the future np.bool will be defined as the corresponding NumPy scalar. self._is_done = np.zeros(self.batch_size, dtype=np.bool) Traceback (most recent call last): Traceback (most recent call last): File "/root/GLM-130B/generate.py", line 215, in Traceback (most recent call last): File "/root/GLM-130B/generate.py", line 215, in Traceback (most recent call last): Traceback (most recent call last): File "/root/GLM-130B/generate.py", line 215, in File "/root/GLM-130B/generate.py", line 215, in File "/root/GLM-130B/generate.py", line 215, in Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): File "/root/GLM-130B/generate.py", line 215, in File "/root/GLM-130B/generate.py", line 215, in File "/root/GLM-130B/generate.py", line 215, in main(args)main(args)

  File "/root/GLM-130B/generate.py", line 165, in main
      File "/root/GLM-130B/generate.py", line 165, in main

main(args)main(args)

main(args) main(args)main(args) File "/root/GLM-130B/generate.py", line 165, in main File "/root/GLM-130B/generate.py", line 165, in main

File "/root/GLM-130B/generate.py", line 165, in main main(args) File "/root/GLM-130B/generate.py", line 165, in main

strategy = BaseStrategy(strategy = BaseStrategy(

File "/root/GLM-130B/generation/strategies.py", line 17, in init File "/root/GLM-130B/generate.py", line 165, in main File "/root/GLM-130B/generate.py", line 165, in main strategy = BaseStrategy( strategy = BaseStrategy( File "/root/GLM-130B/generation/strategies.py", line 17, in init strategy = BaseStrategy( File "/root/GLM-130B/generation/strategies.py", line 17, in init File "/root/GLM-130B/generation/strategies.py", line 17, in init

  File "/root/GLM-130B/generation/strategies.py", line 17, in __init__

self._is_done = np.zeros(self.batch_size, dtype=np.bool) File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr strategy = BaseStrategy( File "/root/GLM-130B/generation/strategies.py", line 17, in init strategy = BaseStrategy( File "/root/GLM-130B/generation/strategies.py", line 17, in init strategy = BaseStrategy( self._is_done = np.zeros(self.batch_size, dtype=np.bool) self._is_done = np.zeros(self.batch_size, dtype=np.bool)self._is_done = np.zeros(self.batch_size, dtype=np.bool)self._is_done = np.zeros(self.batch_size, dtype=np.bool) File "/root/GLM-130B/generation/strategies.py", line 17, in init

self._is_done = np.zeros(self.batch_size, dtype=np.bool) File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr

File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr self._is_done = np.zeros(self.batch_size, dtype=np.bool) File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr self._is_done = np.zeros(self.batch_size, dtype=np.bool) File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/numpy/init.py", line 305, in getattr raise AttributeError(former_attrs[attr]) AttributeError: module 'numpy' has no attribute 'bool'. np.bool was a deprecated alias for the builtin bool. To avoid this error in existing code, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations raise AttributeError(former_attrs[attr]) raise AttributeError(former_attrs[attr])raise AttributeError(former_attrs[attr]) raise AttributeError(former_attrs[attr])

AttributeError raise AttributeError(former_attrs[attr]) : raise AttributeError(former_attrs[attr])AttributeError AttributeErrormodule 'numpy' has no attribute 'bool'. np.bool was a deprecated alias for the builtin bool. To avoid this error in existing code, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations AttributeError: : AttributeError raise AttributeError(former_attrs[attr]): AttributeErrormodule 'numpy' has no attribute 'bool'. np.bool was a deprecated alias for the builtin bool. To avoid this error in existing code, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsmodule 'numpy' has no attribute 'bool'. np.bool was a deprecated alias for the builtin bool. To avoid this error in existing code, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations: module 'numpy' has no attribute 'bool'. np.bool was a deprecated alias for the builtin bool. To avoid this error in existing code, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations:

module 'numpy' has no attribute 'bool'. `np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations AttributeErrormodule 'numpy' has no attribute 'bool'. `np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations : module 'numpy' has no attribute 'bool'. `np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here. The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52609 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52610 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52611 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52612 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52613 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52614 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52615 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 52608) of binary: /root/miniconda3/envs/py39/bin/python Traceback (most recent call last): File "/root/miniconda3/envs/py39/bin/torchrun", line 8, in sys.exit(main()) File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/root/GLM-130B/generate.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-06-06_23:21:56 host : iZbp1219pbxs72sxk8onovZ rank : 0 (local_rank: 0) exitcode : 1 (pid: 52608) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Jun 06 '23 15:06 yihuaxiang

@Sengxian 您了解吗？求指点🙏

Jun 06 '23 15:06 yihuaxiang

已定位到问题，numpy版本的问题，安装 1.20.3即可

Jun 06 '23 16:06 yihuaxiang

@yihuaxiang 您好，您可以分享一下权重吗？

Jun 07 '23 06:06 zhyj3038

@zhyj3038

都是默认值，并没有修改，是因为安装了更新的numpy导致的，安装 1.20.3 就没报错了

Jun 07 '23 06:06 yihuaxiang

@yihuaxiang 您遇到的这个问题，我也遇到并改好了。我这边是没有权重文件，自己随机初始化了一下，然后报其他的错误，例如： IndexError: Out of range: piece id is out of range. 所以想下载一个权重试试能不能解决

Jun 07 '23 06:06 zhyj3038

@yihuaxiang 您遇到的这个问题，我也遇到并改好了。我这边是没有权重文件，自己随机初始化了一下，然后报其他的错误，例如： IndexError: Out of range: piece id is out of range. 所以想下载一个权重试试能不能解决

哦哦，好吧，我也没改什么权重，直接 clone 的代码

Jun 07 '23 06:06 yihuaxiang

我是没权重，给作者发了邮件，好久没收到回复

Jun 07 '23 06:06 zhyj3038

不太理解，为啥我就不缺少权重文件呢？🤔️

zhang ya jun @.***> 于2023年6月7日周三 14:52写道：

我是没权重，给作者发了邮件，好久没收到回复

— Reply to this email directly, view it on GitHub https://github.com/THUDM/GLM-130B/issues/168#issuecomment-1580019865, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZ7FDXN76YQJEZPCOABGMTXKAQLVANCNFSM6AAAAAAY4S22V4 . You are receiving this because you were mentioned.Message ID: @.***>

Jun 07 '23 07:06 yihuaxiang

不你跟作者发邮件下载权重，就可以跑起来？不太可能吧

Jun 07 '23 07:06 zhyj3038

是啊，直接 clone 代码，不需要什么权重

zhang ya jun @.***> 于2023年6月7日周三 15:21写道：

不你跟作者发邮件下载权重，就可以跑起来？不太可能吧

— Reply to this email directly, view it on GitHub https://github.com/THUDM/GLM-130B/issues/168#issuecomment-1580082536, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZ7FDXQWIYXNFLL5PW7EL3XKATV7ANCNFSM6AAAAAAY4S22V4 . You are receiving this because you were mentioned.Message ID: @.***>

Jun 07 '23 07:06 yihuaxiang

已经运行成功了

zhang ya jun @.***> 于2023年6月7日周三 15:21写道：

不你跟作者发邮件下载权重，就可以跑起来？不太可能吧

— Reply to this email directly, view it on GitHub https://github.com/THUDM/GLM-130B/issues/168#issuecomment-1580082536, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZ7FDXQWIYXNFLL5PW7EL3XKATV7ANCNFSM6AAAAAAY4S22V4 . You are receiving this because you were mentioned.Message ID: @.***>

Jun 07 '23 07:06 yihuaxiang

我加您一个微信请教一下吧我的是zhyj3038 麻烦您加一下，谢谢~~

Jun 07 '23 07:06 zhyj3038

👌

zhang ya jun @.***> 于2023年6月7日周三 15:25写道：

我加您一个微信请教一下吧我的是zhyj3038 麻烦您加一下，谢谢~~

— Reply to this email directly, view it on GitHub https://github.com/THUDM/GLM-130B/issues/168#issuecomment-1580089930, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZ7FDS42RINJ3ORY6QWOTDXKAUFPANCNFSM6AAAAAAY4S22V4 . You are receiving this because you were mentioned.Message ID: @.***>

Jun 07 '23 07:06 yihuaxiang

大佬们试过转int4的推理吗

Jun 14 '23 01:06 wenshuop

GLM-130B GLM-130B copied to clipboard

V100（8 * 32G）运行报错

配置修改内容如下：

报错信息如下：

核心报错信息：

完整报错信息

/root/GLM-130B/generate.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-06-06_23:21:56 host : iZbp1219pbxs72sxk8onovZ rank : 0 (local_rank: 0) exitcode : 1 (pid: 52608) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

GLM-130B
GLM-130B copied to clipboard