GLM-130B
GLM-130B copied to clipboard
执行bash scripts/generate.sh --input-source interactive时出现的错误。大佬救救!
(glm130b) zdbp@zdbp-ThinkStation-P920:~/GLM-130B-main$ bash scripts/generate.sh --input-source interactive [2023-12-20 12:31:35,264] torch.distributed.run: [WARNING] [2023-12-20 12:31:35,264] torch.distributed.run: [WARNING] ***************************************** [2023-12-20 12:31:35,264] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2023-12-20 12:31:35,264] torch.distributed.run: [WARNING] ***************************************** [2023-12-20 12:31:38,081] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-20 12:31:38,121] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-20 12:31:38,198] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-20 12:31:38,205] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-20 12:31:38,225] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-20 12:31:38,250] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-20 12:31:38,261] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-20 12:31:38,294] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced WARNING: No training data specified WARNING: No training data specified WARNING: No training data specified using world size: 8 and model-parallel size: 8
padded vocab (size: 150528) with 0 dummy tokens (new size: 150528) WARNING: No training data specified initializing model parallel with size 8 WARNING: No training data specified Set tokenizer as a icetk-glm-130B tokenizer! Now you can get_tokenizer() everywhere. Traceback (most recent call last): File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args) File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize args = get_args(args_list) File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args initialize_distributed(args) File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed torch.cuda.set_device(args.device) File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
WARNING: No training data specified
WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
[2023-12-20 12:31:45,421] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 199262 closing signal SIGTERM
[2023-12-20 12:31:45,421] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 199263 closing signal SIGTERM
[2023-12-20 12:31:45,421] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 199265 closing signal SIGTERM
[2023-12-20 12:31:45,600] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 3 (pid: 199266) of binary: /home/zdbp/anaconda3/envs/glm130b/bin/python
Traceback (most recent call last):
File "/home/zdbp/anaconda3/envs/glm130b/bin/torchrun", line 8, in
sys.exit(main())
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/zdbp/PengJian/GLM-130B-main/generate.py FAILED
Failures: [1]: time : 2023-12-20_12:31:45 host : zdbp-ThinkStation-P920 rank : 4 (local_rank: 4) exitcode : 1 (pid: 199267) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-12-20_12:31:45 host : zdbp-ThinkStation-P920 rank : 5 (local_rank: 5) exitcode : 1 (pid: 199268) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2023-12-20_12:31:45 host : zdbp-ThinkStation-P920 rank : 6 (local_rank: 6) exitcode : 1 (pid: 199269) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2023-12-20_12:31:45 host : zdbp-ThinkStation-P920 rank : 7 (local_rank: 7) exitcode : 1 (pid: 199270) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2023-12-20_12:31:45 host : zdbp-ThinkStation-P920 rank : 3 (local_rank: 3) exitcode : 1 (pid: 199266) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
(glm130b) zdbp@zdbp-ThinkStation-P920:~/PengJian/GLM-130B-main$ pip install torchrun Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple ERROR: Could not find a version that satisfies the requirement torchrun (from versions: none) ERROR: No matching distribution found for torchrun (glm130b) zdbp@zdbp-ThinkStation-P920:~/PengJian/GLM-130B-main$ bash scripts/generate.sh --input-source interactive python: can't open file '/home/zdbp/PengJian/GLM-130B-main/8': [Errno 2] No such file or directory (glm130b) zdbp@zdbp-ThinkStation-P920:~/PengJian/GLM-130B-main$ (glm130b) zdbp@zdbp-ThinkStation-P920:~/PengJian/GLM-130B-main$ pip install bminf Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Collecting bminf Downloading https://pypi.tuna.tsinghua.edu.cn/packages/1b/9b/56bbb3f30672e11e64ab0da315459f65d5ae8608e379a41ea6ef442dffb6/bminf-2.0.1-py3-none-any.whl (52 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52.3/52.3 kB 690.4 kB/s eta 0:00:00 Requirement already satisfied: torch in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from bminf) (2.1.1+cu121) Requirement already satisfied: cpm-kernels>=1.0.9 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from bminf) (1.0.11) Requirement already satisfied: typing-extensions in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from bminf) (4.9.0) Requirement already satisfied: filelock in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (3.9.0) Requirement already satisfied: sympy in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (1.12) Requirement already satisfied: networkx in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (3.0) Requirement already satisfied: jinja2 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (3.1.2) Requirement already satisfied: fsspec in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (2023.10.0) Requirement already satisfied: triton==2.1.0 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from torch->bminf) (2.1.0) Requirement already satisfied: MarkupSafe>=2.0 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from jinja2->torch->bminf) (2.1.3) Requirement already satisfied: mpmath>=0.19 in /home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages (from sympy->torch->bminf) (1.3.0) Installing collected packages: bminf Successfully installed bminf-2.0.1 (glm130b) zdbp@zdbp-ThinkStation-P920:~/PengJian/GLM-130B-main$ bash scripts/generate.sh --input-source interactive [2023-12-20 12:36:32,086] torch.distributed.run: [WARNING] [2023-12-20 12:36:32,086] torch.distributed.run: [WARNING] ***************************************** [2023-12-20 12:36:32,086] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2023-12-20 12:36:32,086] torch.distributed.run: [WARNING] ***************************************** [2023-12-20 12:36:34,707] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-20 12:36:34,756] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-20 12:36:34,961] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-20 12:36:35,021] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-20 12:36:35,036] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-20 12:36:35,073] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-20 12:36:35,147] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-12-20 12:36:35,153] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use fused_layer_norm, fall back to torch.nn.LayerNorm Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced Please install apex to use FusedScaleMaskSoftmax, otherwise the inference efficiency will be greatly reduced WARNING: No training data specified using world size: 8 and model-parallel size: 8
padded vocab (size: 150528) with 0 dummy tokens (new size: 150528) initializing model parallel with size 8 Set tokenizer as a icetk-glm-130B tokenizer! Now you can get_tokenizer() everywhere. WARNING: No training data specified WARNING: No training data specified Traceback (most recent call last): File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in
args = initialize(extra_args_provider=add_generation_specific_args) File "/home/zdbp/PengJian/GLM-130B-main/initialize.py", line 48, in initialize args = get_args(args_list) File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 385, in get_args initialize_distributed(args) File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/SwissArmyTransformer/arguments.py", line 414, in initialize_distributed torch.cuda.set_device(args.device) File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/cuda/init.py", line 404, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
WARNING: No training data specified
WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
WARNING: No training data specified
WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
WARNING: No training data specified
Traceback (most recent call last):
File "/home/zdbp/PengJian/GLM-130B-main/generate.py", line 212, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
[2023-12-20 12:36:42,265] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 209418 closing signal SIGTERM
[2023-12-20 12:36:42,265] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 209419 closing signal SIGTERM
[2023-12-20 12:36:42,266] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 209420 closing signal SIGTERM
[2023-12-20 12:36:42,431] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 3 (pid: 209422) of binary: /home/zdbp/anaconda3/envs/glm130b/bin/python
Traceback (most recent call last):
File "/home/zdbp/anaconda3/envs/glm130b/bin/torchrun", line 8, in
sys.exit(main())
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/zdbp/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/zdbp/PengJian/GLM-130B-main/generate.py FAILED
Failures: [1]: time : 2023-12-20_12:36:42 host : zdbp-ThinkStation-P920 rank : 4 (local_rank: 4) exitcode : 1 (pid: 209423) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-12-20_12:36:42 host : zdbp-ThinkStation-P920 rank : 5 (local_rank: 5) exitcode : 1 (pid: 209424) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2023-12-20_12:36:42 host : zdbp-ThinkStation-P920 rank : 6 (local_rank: 6) exitcode : 1 (pid: 209425) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2023-12-20_12:36:42 host : zdbp-ThinkStation-P920 rank : 7 (local_rank: 7) exitcode : 1 (pid: 209426) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2023-12-20_12:36:42 host : zdbp-ThinkStation-P920 rank : 3 (local_rank: 3) exitcode : 1 (pid: 209422) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
请问解决了吗,我也碰到了一样的错误,救