DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

bing_bert script error

Open jeyblu opened this issue 2 years ago • 4 comments

Error occurred running bing_bert/ds_train_bert_nvidia_data_bsz64k_seq128.sh

Detected CUDA files, patching ldflags Emitting ninja build file /home/bduser/.cache/torch_extensions/py38_cu114/fused_lamb/build.ninja... Building extension module fused_lamb... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/2] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_lamb -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -isystem /home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/include -isystem /home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/include/TH -isystem /home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/bduser/anaconda3/envs/deepspeed/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -std=c++14 -c /home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/ops/csrc/lamb/fused_lamb_cuda_kernel.cu -o fused_lamb_cuda_kernel.cuda.o FAILED: fused_lamb_cuda_kernel.cuda.o /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_lamb -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1013" -I/home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -isystem /home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/include -isystem /home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/include/TH -isystem /home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/bduser/anaconda3/envs/deepspeed/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -lineinfo -O3 --use_fast_math -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -std=c++14 -c /home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/ops/csrc/lamb/fused_lamb_cuda_kernel.cu -o fused_lamb_cuda_kernel.cuda.o /home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/ops/csrc/lamb/fused_lamb_cuda_kernel.cu(467): error: identifier "THCudaCheck" is undefined

/home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/ops/csrc/lamb/fused_lamb_cuda_kernel.cu(351): warning: variable "threads" was declared but never referenced

1 error detected in the compilation of "/home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/ops/csrc/lamb/fused_lamb_cuda_kernel.cu". ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1718, in _run_ninja_build subprocess.run( File "/home/bduser/anaconda3/envs/deepspeed/lib/python3.8/subprocess.py", line 512, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/bduser/src/deepspeedexamples/bing_bert/deepspeed_train.py", line 600, in main() File "/home/bduser/src/deepspeedexamples/bing_bert/deepspeed_train.py", line 589, in main model, optimizer = prepare_model_optimizer(args) File "/home/bduser/src/deepspeedexamples/bing_bert/deepspeed_train.py", line 468, in prepare_model_optimizer model.network, optimizer, _, _ = deepspeed.initialize( File "/home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/init.py", line 131, in initialize engine = DeepSpeedEngine(args=args, File "/home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 293, in init self._configure_optimizer(optimizer, model_parameters) File "/home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1093, in _configure_optimizer basic_optimizer = self._configure_basic_optimizer(model_parameters) File "/home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1193, in _configure_basic_optimizer optimizer = FusedLamb(model_parameters, **optimizer_parameters) File "/home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/ops/lamb/fused_lamb.py", line 51, in init self.fused_lamb_cuda = FusedLambBuilder().load() File "/home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 367, in load return self.jit_load(verbose) File "/home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 399, in jit_load op_module = load( File "/home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1125, in load return _jit_compile( File "/home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1338, in _jit_compile _write_ninja_file_and_build_library( File "/home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1450, in _write_ninja_file_and_build_library _run_ninja_build( File "/home/bduser/anaconda3/envs/deepspeed/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1734, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'fused_lamb'

jeyblu avatar Nov 21 '21 23:11 jeyblu