transformers-bloom-inference
transformers-bloom-inference copied to clipboard
NotImplementedError: Cannot copy out of meta tensor; no data!
Hi,
I now employ the deepspeed framework to speed up the inference of BLOOM 7.1B, as shown below:
deepspeed --num_gpus 4 bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloom-7b1
But instead I got the following bugs:
(bloom) xxx@HOST-xxx:~/projects/transformers-bloom-inference/bloom-inference-scripts$ bash run_deepspeed.sh
[2023-02-10 17:46:16,148] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-02-10 17:46:16,202] [INFO] [runner.py:548:main] cmd = /home/caojunzhi/anaconda3/envs/chatgpt/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None bloom-ds-inference.py --name bigscience/bloom-7b1
[2023-02-10 17:46:19,604] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-02-10 17:46:19,604] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-02-10 17:46:19,604] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-02-10 17:46:19,604] [INFO] [launch.py:162:main] dist_world_size=1
[2023-02-10 17:46:19,604] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-02-10 17:46:23,455] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
*** Loading the model bigscience/bloom-7b1
Fetching 13 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 33951.40it/s]
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 8339.85it/s]
Fetching 13 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 7358.43it/s]
Fetching 13 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 26572.10it/s]
[2023-02-10 17:46:33,775] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
[2023-02-10 17:46:33,778] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-02-10 17:46:33,779] [INFO] [logging.py:68:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /data/xxx/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /data/xxx/.cache/torch_extensions/py310_cu117/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.1198277473449707 seconds
[2023-02-10 17:46:34,344] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 4096, 'intermediate_size': 16384, 'heads': 32, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': True, 'max_out_tokens': 1024, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False}
Installed CUDA version 11.1 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /data/xxx/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.0038442611694335938 seconds
Loading 2 checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:21<00:00, 9.94s/it]checkpoint loading time at rank 0: 21.33984684944153 sec
Loading 2 checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:21<00:00, 10.67s/it]
Traceback (most recent call last):
File "/data/xxx/projects/transformers-bloom-inference/bloom-inference-scripts/bloom-ds-inference.py", line 181, in <module>
model = deepspeed.init_inference(
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/__init__.py", line 311, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 129, in __init__
self.module.to(device)
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/transformers/modeling_utils.py", line 1749, in to
return super().to(*args, **kwargs)
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 989, in to
return self._apply(convert)
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 664, in _apply
param_applied = fn(param)
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 987, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
[2023-02-10 17:46:57,652] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 25235
[2023-02-10 17:46:57,653] [ERROR] [launch.py:324:sigkill_handler] ['/home/caojunzhi/anaconda3/envs/chatgpt/bin/python', '-u', 'bloom-ds-inference.py', '--local_rank=0', '--name', 'bigscience/bloom-7b1'] exits with return code = 1
My main conda environment is:
accelerate 0.16.0
deepspeed 0.8.0
deepspeed-mii 0.0.2
huggingface-hub 0.12.0
tokenizers 0.12.1
torch 1.13.1
transformers 4.26.0
My nvidia-smi info is:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:00:06.0 Off | 0 |
| N/A 35C P0 37W / 250W | 1253MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:00:07.0 Off | 0 |
| N/A 37C P0 40W / 250W | 2411MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-PCIE... Off | 00000000:00:08.0 Off | 0 |
| N/A 32C P0 24W / 250W | 4MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-PCIE... Off | 00000000:00:09.0 Off | 0 |
| N/A 33C P0 24W / 250W | 4MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Can you help me to solve this bug? Thank you very much!
This is a bug in DeepSpeed. Can you report it there? Also, fyi DS-inference doesn't work with pytorch 1.13.1 yet. I would suggest to fall back to 1.12.1
Thanks for your reply. When I changged torch down to 1.12.1 and brought cuda up to the suitable version (10.2.89), the previous error indeed disappeared, but a new one came, as shown below.
[2023-02-12 10:19:51,085] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-02-12 10:19:51,252] [INFO] [runner.py:548:main] cmd = /usr/local/tools/Python-3.10.9/bin/python3.10 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None bloom-ds-inference.py --name /home/zandaoguang/downloads/bloom-7b1
[2023-02-12 10:19:53,867] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-02-12 10:19:53,868] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-02-12 10:19:53,868] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-02-12 10:19:53,868] [INFO] [launch.py:162:main] dist_world_size=1
[2023-02-12 10:19:53,868] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-02-12 10:19:56,839] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
*** Loading the model /home/zandaoguang/downloads/bloom-7b1
[2023-02-12 10:20:01,592] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
[2023-02-12 10:20:01,594] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-02-12 10:20:01,594] [INFO] [logging.py:68:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Using /root/.cache/torch_extensions/py310_cu102 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu102/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/4] //usr/local/cuda-10.2/bin/nvcc -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/TH -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/THC -isystem //usr/local/cuda-10.2/include -isystem /usr/local/tools/Python-3.10.9/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/gelu.cu -o gelu.cuda.o
FAILED: gelu.cuda.o
//usr/local/cuda-10.2/bin/nvcc -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/TH -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/THC -isystem //usr/local/cuda-10.2/include -isystem /usr/local/tools/Python-3.10.9/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/gelu.cu -o gelu.cuda.o
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/conversion_utils.h(268): error: identifier "__double2half" is undefined
1 error detected in the compilation of "/tmp/tmpxft_00006b7b_00000000-6_gelu.cpp1.ii".
[2/4] //usr/local/cuda-10.2/bin/nvcc -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/TH -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/THC -isystem //usr/local/cuda-10.2/include -isystem /usr/local/tools/Python-3.10.9/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/relu.cu -o relu.cuda.o
FAILED: relu.cuda.o
//usr/local/cuda-10.2/bin/nvcc -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/TH -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/THC -isystem //usr/local/cuda-10.2/include -isystem /usr/local/tools/Python-3.10.9/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/relu.cu -o relu.cuda.o
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/conversion_utils.h(268): error: identifier "__double2half" is undefined
1 error detected in the compilation of "/tmp/tmpxft_00006b7c_00000000-6_relu.cpp1.ii".
[3/4] //usr/local/cuda-10.2/bin/nvcc -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/TH -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/THC -isystem //usr/local/cuda-10.2/include -isystem /usr/local/tools/Python-3.10.9/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu -o layer_norm.cuda.o
FAILED: layer_norm.cuda.o
//usr/local/cuda-10.2/bin/nvcc -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/TH -isystem /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/include/THC -isystem //usr/local/cuda-10.2/include -isystem /usr/local/tools/Python-3.10.9/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu -o layer_norm.cuda.o
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/conversion_utils.h(268): error: identifier "__double2half" is undefined
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]"
(165): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]"
(165): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/reduction_utils.h(520): error: class "cooperative_groups::__v1::thread_block_tile<32U>" has no member "meta_group_rank"
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]"
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(165): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/reduction_utils.h(409): error: class "cooperative_groups::__v1::thread_block_tile<32U>" has no member "meta_group_size"
detected during:
instantiation of "void reduce::partitioned_block<Op,num_threads>(cooperative_groups::__v1::thread_block &, cooperative_groups::__v1::thread_block_tile<32U> &, float &) [with Op=reduce::ROpType::Add, num_threads=1]"
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(72): here
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]"
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(165): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/reduction_utils.h(414): error: class "cooperative_groups::__v1::thread_block_tile<32U>" has no member "meta_group_rank"
detected during:
instantiation of "void reduce::partitioned_block<Op,num_threads>(cooperative_groups::__v1::thread_block &, cooperative_groups::__v1::thread_block_tile<32U> &, float &) [with Op=reduce::ROpType::Add, num_threads=1]"
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(72): here
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]"
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(165): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/reduction_utils.h(421): error: class "cooperative_groups::__v1::thread_block_tile<32U>" has no member "meta_group_rank"
detected during:
instantiation of "void reduce::partitioned_block<Op,num_threads>(cooperative_groups::__v1::thread_block &, cooperative_groups::__v1::thread_block_tile<32U> &, float &) [with Op=reduce::ROpType::Add, num_threads=1]"
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(72): here
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]"
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(165): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/reduction_utils.h(422): error: class "cooperative_groups::__v1::thread_block_tile<32U>" has no member "meta_group_size"
detected during:
instantiation of "void reduce::partitioned_block<Op,num_threads>(cooperative_groups::__v1::thread_block &, cooperative_groups::__v1::thread_block_tile<32U> &, float &) [with Op=reduce::ROpType::Add, num_threads=1]"
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(72): here
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]"
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(165): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/includes/reduction_utils.h(447): error: class "cooperative_groups::__v1::thread_block_tile<32U>" has no member "meta_group_rank"
detected during:
instantiation of "void reduce::partitioned_block<Op,num_threads>(cooperative_groups::__v1::thread_block &, cooperative_groups::__v1::thread_block_tile<32U> &, float &) [with Op=reduce::ROpType::Add, num_threads=1]"
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(72): here
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=1, maxThreads=256]"
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(165): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=2, maxThreads=256]"
(167): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=2, maxThreads=256]"
(167): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=4, maxThreads=256]"
(169): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=4, maxThreads=256]"
(169): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=8, maxThreads=256]"
(171): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=8, maxThreads=256]"
(171): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=16, maxThreads=256]"
(173): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=1, threadsPerGroup=16, maxThreads=256]"
(173): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=2, threadsPerGroup=256, maxThreads=256]"
(178): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=2, threadsPerGroup=256, maxThreads=256]"
(178): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=4, threadsPerGroup=256, maxThreads=256]"
(181): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=4, threadsPerGroup=256, maxThreads=256]"
(181): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=6, threadsPerGroup=256, maxThreads=256]"
(184): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=6, threadsPerGroup=256, maxThreads=256]"
(184): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=8, threadsPerGroup=256, maxThreads=256]"
(187): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=__half, unRoll=8, threadsPerGroup=256, maxThreads=256]"
(187): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=__half]"
(191): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=1, maxThreads=256]"
(165): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]"
(199): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=1, maxThreads=256]"
(165): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]"
(199): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=2, maxThreads=256]"
(167): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]"
(199): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=2, maxThreads=256]"
(167): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]"
(199): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=4, maxThreads=256]"
(169): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]"
(199): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=4, maxThreads=256]"
(169): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]"
(199): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=8, maxThreads=256]"
(171): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]"
(199): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=8, maxThreads=256]"
(171): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]"
(199): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=16, maxThreads=256]"
(173): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]"
(199): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=1, threadsPerGroup=16, maxThreads=256]"
(173): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]"
(199): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=4, threadsPerGroup=256, maxThreads=256]"
(178): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]"
(199): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=4, threadsPerGroup=256, maxThreads=256]"
(178): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]"
(199): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=8, threadsPerGroup=256, maxThreads=256]"
(181): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]"
(199): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=8, threadsPerGroup=256, maxThreads=256]"
(181): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]"
(199): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=12, threadsPerGroup=256, maxThreads=256]"
(184): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]"
(199): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=12, threadsPerGroup=256, maxThreads=256]"
(184): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]"
(199): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(59): warning: variable "residual_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=16, threadsPerGroup=256, maxThreads=256]"
(187): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]"
(199): here
/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu(60): warning: variable "bias_buffer" was declared but never referenced
detected during:
instantiation of "void fused_ln<T,unRoll,threadsPerGroup,maxThreads>(T *, const T *, const T *, const T *, float, int) [with T=float, unRoll=16, threadsPerGroup=256, maxThreads=256]"
(187): here
instantiation of "void launch_fused_ln(T *, const T *, const T *, const T *, float, int, int, cudaStream_t) [with T=float]"
(199): here
7 errors detected in the compilation of "/tmp/tmpxft_00006b7d_00000000-6_layer_norm.cpp1.ii".
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1808, in _run_ninja_build
subprocess.run(
File "/usr/local/tools/Python-3.10.9/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/data/zandaoguang/projects/transformers-bloom-inference/bloom-inference-scripts/bloom-ds-inference.py", line 183, in <module>
model = deepspeed.init_inference(
File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/__init__.py", line 311, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 126, in __init__
self._apply_injection_policy(config)
File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 339, in _apply_injection_policy
replace_transformer_layer(client_module,
File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 792, in replace_transformer_layer
replaced_module = replace_module(model=model,
File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1061, in replace_module
replaced_module, _ = _replace_module(model, policy)
File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1088, in _replace_module
_, layer_id = _replace_module(child, policies, layer_id=layer_id)
File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1088, in _replace_module
_, layer_id = _replace_module(child, policies, layer_id=layer_id)
File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1078, in _replace_module
replaced_module = policies[child.__class__][0](child,
File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 782, in replace_fn
new_module = replace_with_policy(child,
File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 473, in replace_with_policy
new_module = transformer_inference.DeepSpeedTransformerInference(
File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 53, in __init__
inference_cuda_module = builder.load()
File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 462, in load
return self.jit_load(verbose)
File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 497, in jit_load
op_module = load(
File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1202, in load
return _jit_compile(
File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1425, in _jit_compile
_write_ninja_file_and_build_library(
File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1537, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/usr/local/tools/Python-3.10.9/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1824, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'transformer_inference'
[2023-02-12 10:20:03,879] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 27397
[2023-02-12 10:20:03,880] [ERROR] [launch.py:324:sigkill_handler] ['/usr/local/tools/Python-3.10.9/bin/python3.10', '-u', 'bloom-ds-inference.py', '--local_rank=0', '--name', '/home/zandaoguang/downloads/bloom-7b1'] exits with return code = 1
My conda env ls (Python 3.10.9):
Package Version
------------------ ----------
accelerate 0.16.0
certifi 2022.12.7
charset-normalizer 3.0.1
deepspeed 0.8.0
filelock 3.9.0
hjson 3.1.0
huggingface-hub 0.12.0
idna 3.4
ninja 1.11.1
numpy 1.24.2
packaging 23.0
pip 22.3.1
psutil 5.9.4
py-cpuinfo 9.0.0
pydantic 1.10.4
PyYAML 6.0
regex 2022.10.31
requests 2.28.2
setuptools 65.5.0
tokenizers 0.12.1
torch 1.12.1
tqdm 4.64.1
transformers 4.26.0
typing_extensions 4.4.0
urllib3 1.26.14
The nvcc -V
result is:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
Can you help me to solve it? Thanks.
I am not really sure. Haven't seen this before but seems like CUDA is not able to compile some kernels in DeepSpeed. I am using CUDA 11.6 with 8x A100 80GB GPUs. Can you try to switch to CUDA 11.6? If not, there is a dockerfile that is tested and it works fine.
However, you will need to modify it a bit for the standalone script. I am using it for the inference server.
Actually, I can only use cuda with version 10.2, as I am using other versions of cuda that report the following error:
[2023-02-12 16:48:25,193] [WARNING] [runner.py:179:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-02-12 16:48:25,352] [INFO] [runner.py:508:main] cmd = /home/caojunzhi/anaconda3/envs/chatgpt/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 bloom-ds-inference.py --name /home/zandaoguang/downloads/bloom-7b1
[2023-02-12 16:48:27,793] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-02-12 16:48:27,793] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-02-12 16:48:27,793] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-02-12 16:48:27,793] [INFO] [launch.py:162:main] dist_world_size=1
[2023-02-12 16:48:27,793] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-02-12 16:48:30,664] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
*** Loading the model /home/zandaoguang/downloads/bloom-7b1
[2023-02-12 16:48:35,960] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.6, git-hash=unknown, git-branch=unknown
[2023-02-12 16:48:35,963] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-02-12 16:48:35,963] [INFO] [logging.py:68:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Traceback (most recent call last):
File "/data/zandaoguang/projects/transformers-bloom-inference/bloom-inference-scripts/bloom-ds-inference.py", line 183, in <module>
model = deepspeed.init_inference(
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/__init__.py", line 311, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 124, in __init__
self._apply_injection_policy(config)
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 349, in _apply_injection_policy
replace_transformer_layer(client_module,
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 881, in replace_transformer_layer
replaced_module = replace_module(model=model,
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1139, in replace_module
replaced_module, _ = _replace_module(model, policy)
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1166, in _replace_module
_, layer_id = _replace_module(child, policies, layer_id=layer_id)
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1166, in _replace_module
_, layer_id = _replace_module(child, policies, layer_id=layer_id)
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 1156, in _replace_module
replaced_module = policies[child.__class__][0](child,
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 871, in replace_fn
new_module = replace_with_policy(child,
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 454, in replace_with_policy
new_module = transformer_inference.DeepSpeedTransformerInference(
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 53, in __init__
inference_cuda_module = builder.load()
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 459, in load
return self.jit_load(verbose)
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 474, in jit_load
assert_no_cuda_mismatch()
File "/home/caojunzhi/anaconda3/envs/chatgpt/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 100, in assert_no_cuda_mismatch
raise Exception(
Exception: Installed CUDA version 11.1 does not match the version torch was compiled with 10.2, unable to compile cuda/cpp extensions without a matching cuda version.
[2023-02-12 16:48:37,805] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 15223
[2023-02-12 16:48:37,805] [ERROR] [launch.py:324:sigkill_handler] ['/home/caojunzhi/anaconda3/envs/chatgpt/bin/python', '-u', 'bloom-ds-inference.py', '--local_rank=0', '--name', '/home/zandaoguang/downloads/bloom-7b1'] exits with return code = 1
The pip list
result is:
Package Version
------------------------ ----------
accelerate 0.15.0
aiohttp 3.8.3
aiosignal 1.3.1
anyio 3.6.2
asttokens 2.2.1
async-timeout 4.0.2
asyncio 3.4.3
attrs 22.2.0
backcall 0.2.0
certifi 2022.12.7
charset-normalizer 2.1.1
click 8.1.3
comm 0.1.2
datasets 2.9.0
debugpy 1.6.6
decorator 5.1.1
deepspeed 0.7.6
deepspeed-mii 0.0.4
dill 0.3.6
executing 1.2.0
fastapi 0.89.1
filelock 3.9.0
Flask 2.2.2
Flask-API 3.0.post1
Flask-Cors 3.0.10
frozenlist 1.3.3
fsspec 2023.1.0
grpcio 1.51.1
grpcio-tools 1.50.0
gunicorn 20.1.0
h11 0.14.0
hjson 3.1.0
huggingface-hub 0.10.1
idna 3.4
ipdb 0.13.11
ipykernel 6.21.0
ipython 8.9.0
itsdangerous 2.1.2
jedi 0.18.2
Jinja2 3.1.2
joblib 1.2.0
jupyter_client 8.0.2
jupyter_core 5.2.0
MarkupSafe 2.1.2
matplotlib-inline 0.1.6
multidict 6.0.4
multiprocess 0.70.14
ninja 1.11.1
numpy 1.24.1
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
packaging 23.0
pandas 1.5.3
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
Pillow 9.4.0
pip 23.0
platformdirs 2.6.2
prompt-toolkit 3.0.36
protobuf 4.21.12
psutil 5.9.4
ptyprocess 0.7.0
pure-eval 0.2.2
py-cpuinfo 9.0.0
pyarrow 11.0.0
pydantic 1.10.2
Pygments 2.14.0
python-dateutil 2.8.2
pytz 2022.7.1
PyYAML 6.0
pyzmq 25.0.0
regex 2022.10.31
requests 2.28.2
responses 0.18.0
sacremoses 0.0.53
sentencepiece 0.1.97
setuptools 65.6.3
six 1.16.0
sniffio 1.3.0
stack-data 0.6.2
starlette 0.22.0
tokenizers 0.12.1
tomli 2.0.1
torch 1.12.1
torchvision 0.13.1
tornado 6.2
tqdm 4.64.1
traitlets 5.9.0
transformers 4.25.1
typing_extensions 4.4.0
urllib3 1.26.14
uvicorn 0.19.0
wcwidth 0.2.6
Werkzeug 2.2.2
wheel 0.37.1
xxhash 3.2.0
yarl 1.8.2
The nvcc -V
is:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
The nvidia-smi
result is:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:00:06.0 Off | 0 |
| N/A 32C P0 26W / 250W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-PCIE... Off | 00000000:00:07.0 Off | 0 |
| N/A 33C P0 28W / 250W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-PCIE... Off | 00000000:00:08.0 Off | 0 |
| N/A 32C P0 24W / 250W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-PCIE... Off | 00000000:00:09.0 Off | 0 |
| N/A 33C P0 24W / 250W | 0MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
It feels like a version issue, but I've tried to make sure the version is the same as your docker file. So, have you encountered this problem? Thank you again.
I think your environment is configured with CUDA 11.1 and torch is compiled using 10.2. Can you install torch using the same CUDA version?
Hi @mayank31398, I ran into a similar issue when employing the deepspeed framework to speed up the inference of BLOOM 7.1B. Could you please take a look? Many thanks
The cmd is shown below:
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
The log is listed as follows:
[root@7656ea32130c transformers-bloom-inference]# deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
[2023-03-09 06:03:27,119] [WARNING] [runner.py:179:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-03-09 06:03:30,403] [INFO] [runner.py:508:main] cmd = /opt/conda/envs/inference/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-devel-2.12.10-1+cuda11.6
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.12.10
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.12.10-1
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE=libnccl-2.12.10-1+cuda11.6
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-devel
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_VERSION=2.12.10
[2023-03-09 06:03:32,070] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.12.10-1
[2023-03-09 06:03:32,070] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-03-09 06:03:32,070] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-03-09 06:03:32,070] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-03-09 06:03:32,070] [INFO] [launch.py:162:main] dist_world_size=8
[2023-03-09 06:03:32,070] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-03-09 06:03:34,840] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Downloading (…)lve/main/config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 573/573 [00:00<00:00, 41.8kB/s]
/cos/HF_cache/models--bigscience--bloom/snapshots/ea51bbb9a58423efb336e2d6c900a8b3dc64b2eb
[2023-03-09 06:03:44,806] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.6, git-hash=unknown, git-branch=unknown
[2023-03-09 06:03:44,807] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-09 06:03:44,807] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-09 06:03:44,808] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-09 06:03:44,808] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-09 06:03:44,809] [INFO] [logging.py:68:log_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-03-09 06:03:44,808] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-09 06:03:44,808] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-09 06:03:44,809] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-09 06:03:44,809] [WARNING] [config_utils.py:67:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py39_cu116/transformer_inference...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu116/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/9] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/dequantize.cu -o dequantize.cuda.o
[2/9] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/relu.cu -o relu.cuda.o
[3/9] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu -o transform.cuda.o
/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(56): warning #177-D: variable "lane" was declared but never referenced
/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(95): warning #177-D: variable "half_dim" was declared but never referenced
/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(112): warning #177-D: variable "vals_half" was declared but never referenced
/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(113): warning #177-D: variable "output_half" was declared but never referenced
/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/transform.cu(130): warning #177-D: variable "lane" was declared but never referenced
[4/9] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/apply_rotary_pos_emb.cu -o apply_rotary_pos_emb.cuda.o
[5/9] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/softmax.cu -o softmax.cuda.o
/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/softmax.cu(275): warning #177-D: variable "alibi_offset" was declared but never referenced
/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/softmax.cu(430): warning #177-D: variable "warp_num" was declared but never referenced
[6/9] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/gelu.cu -o gelu.cuda.o
[7/9] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/layer_norm.cu -o layer_norm.cuda.o
[8/9] c++ -MMD -MF pt_binding.o.d -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/TH -isystem /opt/conda/envs/inference/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/envs/inference/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -c /opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp -o pt_binding.o
[9/9] c++ pt_binding.o gelu.cuda.o relu.cuda.o layer_norm.cuda.o softmax.cuda.o dequantize.cuda.o apply_rotary_pos_emb.cuda.o transform.cuda.o -shared -lcurand -L/opt/conda/envs/inference/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o transformer_inference.so
Loading extension module transformer_inference...
Time to load transformer_inference op: 24.044259786605835 seconds
Loading extension module transformer_inference...
Loading extension module transformer_inference...
Time to load transformer_inference op: 24.013221502304077 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 23.90252995491028 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 24.007809162139893 seconds
Time to load transformer_inference op: 24.017361402511597 seconds
Loading extension module transformer_inference...
Time to load transformer_inference op: 23.906622886657715 seconds
Loading extension module transformer_inference...
Loading extension module transformer_inference...
Time to load transformer_inference op: 24.007588863372803 seconds
Time to load transformer_inference op: 24.015749216079712 seconds
[2023-03-09 06:04:09,565] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 14336, 'intermediate_size': 57344, 'heads': 112, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 8, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': True, 'max_out_tokens': 1024, 'scale_attn_by_inverse_layer_idx': False}
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06393146514892578 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...Time to load transformer_inference op: 0.061557769775390625 seconds
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.061757564544677734 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06235527992248535 seconds
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06160426139831543 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06882047653198242 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06495046615600586 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.07005953788757324 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.05634450912475586 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.05931544303894043 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06092071533203125 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.05466651916503906 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.058559417724609375 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.05735135078430176 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.05769968032836914 seconds
Using /root/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06432437896728516 seconds
Loading 0 checkpoint shards: 0it [00:00, ?it/s]checkpoint loading time at rank 6: 0.0035653114318847656 sec
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Loading 0 checkpoint shards: 0it [00:00, ?it/s]checkpoint loading time at rank 4: 0.0038051605224609375 sec
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Loading 0 checkpoint shards: 0it [00:00, ?it/s]checkpoint loading time at rank 3: 0.0014710426330566406 sec
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
Traceback (most recent call last):
File "/opt/conda/envs/inference/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/inference/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/benchmark.py", line 119, in <module>
main()
File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/benchmark.py", line 115, in main
benchmark_end_to_end(args)
File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/benchmark.py", line 48, in benchmark_end_to_end
model, initialization_time = run_and_log_time(partial(ModelDeployment, args=args, grpc_allowed=False))
File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/utils/utils.py", line 152, in run_and_log_time
Loading 0 checkpoint shards: 0it [00:00, ?it/s]checkpoint loading time at rank 7: 0.002664327621459961 sec
Loading 0 checkpoint shards: 0it [00:00, ?it/s]
results = execs()
File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/model_handler/deployment.py", line 54, in __init__
self.model = get_model_class(args.deployment_framework)(args)
File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/models/ds_inference.py", line 53, in __init__
Traceback (most recent call last):
File "/opt/conda/envs/inference/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/inference/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/benchmark.py", line 119, in <module>
main()
File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/benchmark.py", line 115, in main
benchmark_end_to_end(args)
File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/benchmark.py", line 48, in benchmark_end_to_end
model, initialization_time = run_and_log_time(partial(ModelDeployment, args=args, grpc_allowed=False))
File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/utils/utils.py", line 152, in run_and_log_time
results = execs()
File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/model_handler/deployment.py", line 54, in __init__
self.model = get_model_class(args.deployment_framework)(args)
File "/nfs/users1/usera/test/transformers-bloom-inference/inference_server/models/ds_inference.py", line 53, in __init__
self.model = deepspeed.init_inference(
File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/__init__.py", line 311, in init_inference
self.model = deepspeed.init_inference(engine = InferenceEngine(model, config=ds_inference_config)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 127, in __init__
File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/__init__.py", line 311, in init_inference
self.module.to(device)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1749, in to
engine = InferenceEngine(model, config=ds_inference_config)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/deepspeed/inference/engine.py", line 127, in __init__
self.module.to(device)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1749, in to
return super().to(*args, **kwargs)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 927, in to
return super().to(*args, **kwargs)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 927, in to
return self._apply(convert)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
return self._apply(convert)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 602, in _apply
param_applied = fn(param)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 925, in convert
module._apply(fn)
File "/opt/conda/envs/inference/lib/python3.9/site-packages/torch/nn/modules/module.py", line 602, in _apply
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
Here, I use the docker image generated by the Dockerfile from https://github.com/huggingface/transformers-bloom-inference/blob/main/Dockerfile. The pip list
shows
Package Version
------------------ ------------
accelerate 0.16.0
anyio 3.6.2
certifi 2022.12.7
charset-normalizer 3.1.0
click 8.1.3
deepspeed 0.7.6
fastapi 0.89.1
filelock 3.9.0
Flask 2.2.3
Flask-API 3.0.post1
grpcio 1.51.3
grpcio-tools 1.50.0
gunicorn 20.1.0
h11 0.14.0
hjson 3.1.0
huggingface-hub 0.12.1
idna 3.4
importlib-metadata 6.0.0
itsdangerous 2.1.2
Jinja2 3.1.2
MarkupSafe 2.1.2
ninja 1.11.1
numpy 1.24.2
packaging 23.0
pip 23.0.1
protobuf 4.22.1
psutil 5.9.4
py-cpuinfo 9.0.0
pydantic 1.10.2
PyYAML 6.0
regex 2022.10.31
requests 2.28.2
setuptools 65.6.3
sniffio 1.3.0
starlette 0.22.0
tokenizers 0.13.2
torch 1.12.1+cu116
tqdm 4.65.0
transformers 4.26.1
typing_extensions 4.5.0
urllib3 1.26.14
uvicorn 0.19.0
Werkzeug 2.2.3
wheel 0.38.4
zipp 3.15.0
The nvidia-smi
shows:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:27:00.0 Off | 0 |
| N/A 32C P0 69W / 400W | 35MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:2A:00.0 Off | 0 |
| N/A 29C P0 66W / 400W | 35MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:51:00.0 Off | 0 |
| N/A 31C P0 69W / 400W | 35MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:57:00.0 Off | 0 |
| N/A 33C P0 63W / 400W | 35MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:9E:00.0 Off | 0 |
| N/A 32C P0 65W / 400W | 35MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... On | 00000000:A4:00.0 Off | 0 |
| N/A 30C P0 63W / 400W | 35MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... On | 00000000:C7:00.0 Off | 0 |
| N/A 29C P0 64W / 400W | 35MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... On | 00000000:CA:00.0 Off | 0 |
| N/A 32C P0 66W / 400W | 35MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
The nvcc-V
shows
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
The dockerfile works out of the box. Can you give it a shot?
Many thanks for your prompt response @mayank31398
The dockerfile is as follows:
root@super-klb:~/test/transformers-bloom-inference-GPU# cat Dockerfile
FROM nvidia/cuda:11.6.1-devel-ubi8 as base
RUN dnf install -y --disableplugin=subscription-manager make git && dnf clean all --disableplugin=subscription-manager
# taken form pytorch's dockerfile
RUN curl -L -o ./miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
chmod +x ./miniconda.sh && \
./miniconda.sh -b -p /opt/conda && \
rm ./miniconda.sh
ENV PYTHON_VERSION=3.9 \
PATH=/opt/conda/envs/inference/bin:/opt/conda/bin:${PATH}
# create conda env
RUN conda create -n inference python=${PYTHON_VERSION} pip -y
# change shell to activate env
SHELL ["conda", "run", "-n", "inference", "/bin/bash", "-c"]
FROM base as conda
# update conda
RUN conda update -n base -c defaults conda -y
# cmake
RUN conda install -c anaconda cmake -y
# necessary stuff
RUN pip install torch==1.12.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116 \
transformers==4.26.1 \
deepspeed==0.7.6 \
accelerate==0.16.0 \
gunicorn==20.1.0 \
flask \
flask_api \
fastapi==0.89.1 \
uvicorn==0.19.0 \
jinja2==3.1.2 \
pydantic==1.10.2 \
huggingface_hub==0.12.1 \
grpcio-tools==1.50.0 \
--no-cache-dir
# clean conda env
RUN conda clean -ya
# change this as you like 🤗
ENV TRANSFORMERS_CACHE=/cos/HF_cache \
HUGGINGFACE_HUB_CACHE=${TRANSFORMERS_CACHE}
FROM conda as app
WORKDIR /src
RUN chmod -R g+w /src
RUN mkdir /.cache && \
chmod -R g+w /.cache
ENV PORT=5000 \
UI_PORT=5001
EXPOSE ${PORT}
EXPOSE ${UI_PORT}
#CMD git clone https://github.com/huggingface/transformers-bloom-inference.git && \
# cd transformers-bloom-inference && \
# # install grpc and compile protos
# make gen-proto && \
# make bloom-560m
I simply commend the last 5 lines and do them in the docker manually (to avoid repeated git clone the repo when I docker exec the created instance with another terminal) Especially, here are my steps to create the docker and launch the instance:
git clone https://github.com/huggingface/transformers-bloom-inference transformers-bloom-inference-GPU
cd transformers-bloom-inference-GPU
comment out the last 5 lines of Dockerfile as mentioned above
docker build -t transformers-bloom:v1.0 .
docker run --gpus all -it --name="bloom" -v /nfs/users/test:/nfs/users/test -w /nfs/users/test transformers-bloom:v1.0
Then in the docker, make bloom-176b, launch the benchmark, and hit the NotImplementedError: Cannot copy out of meta tensor; no data!
git clone https://github.com/huggingface/transformers-bloom-inference
cd transformers-bloom-inference && \
# install grpc and compile protos
make gen-proto && \
make bloom-176b
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
To supplement: I could successfully run the benchmark of bloom3b and get the perf data. First add the following lines in Makefile
bloom-3b:
make ui
TOKENIZERS_PARALLELISM=false \
MODEL_NAME=bigscience/bloom-3b \
MODEL_CLASS=AutoModelForCausalLM \
DEPLOYMENT_FRAMEWORK=ds_inference \
DTYPE=fp16 \
MAX_INPUT_LENGTH=32 \
MAX_BATCH_SIZE=4 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
gunicorn -t 0 -w 1 -b 127.0.0.1:5000 inference_server.server:app --access-logfile - --access-logformat '%(h)s %(t)s "%(r)s" %(s)s %(b)s'
Then
bloom-3b
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom-3b --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
Not sure why 176b is not working. I will try to look into it :)