Problem with Deepspeed integration
System Info
transformersversion: 4.29.2- Platform: Linux-5.4.0-137-generic-x86_64-with-glibc2.31
- Python version: 3.11.3
- Huggingface_hub version: 0.15.1
- Safetensors version: 0.3.1
- PyTorch version (GPU?): 2.0.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes
Who can help?
No response
Information
- [x] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [x] My own task or dataset (give details below)
Reproduction
I am using the WizardCoder training script to further fine-tune the model on some examples that I have using DeepSpeed integration. I have followed their instructions given here to fine-tune the model and I am getting the following error:
datachat_env) [email protected]:~/Custom-LLM$ sh train.sh
[2023-06-23 00:36:25,039] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-06-23 00:36:25,077] [INFO] [runner.py:541:main] cmd = /root/anaconda3/envs/datachat_env/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None /root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py --model_name_or_path /root/Custom-LLM/WizardCoder-15B-V1.0 --data_path /root/Custom-LLM/data.json --output_dir /root/Custom-LLM/WC-Checkpoint --num_train_epochs 3 --model_max_length 512 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 50 --save_total_limit 2 --learning_rate 2e-5 --warmup_steps 30 --logging_steps 2 --lr_scheduler_type cosine --report_to tensorboard --gradient_checkpointing True --deepspeed /root/Custom-LLM/Llama-X/src/configs/deepspeed_config.json --fp16 True
[2023-06-23 00:36:26,992] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-06-23 00:36:26,993] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-06-23 00:36:26,993] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-06-23 00:36:26,993] [INFO] [launch.py:247:main] dist_world_size=4
[2023-06-23 00:36:26,993] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-06-23 00:36:29,650] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-06-23 00:36:55,124] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 15.82B parameters
[2023-06-23 00:37:12,845] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-06-23 00:37:12,968] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-06-23 00:37:12,969] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-06-23 00:37:12,970] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
FAILED: cpu_adam.o
c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
In file included from /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes/cpu_adam.h:19,
from /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp:6:
/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes/custom_cuda_layers.h:12:10: fatal error: curand_kernel.h: No such file or directory
12 | #include <curand_kernel.h>
| ^~~~~~~~~~~~~~~~~
compilation terminated.
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
[2/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
FAILED: custom_cuda_kernel.cuda.o
/usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
In file included from /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu:6:
/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes/custom_cuda_layers.h:12:10: fatal error: curand_kernel.h: No such file or directory
12 | #include <curand_kernel.h>
| ^~~~~~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/root/anaconda3/envs/datachat_env/lib/python3.11/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 247, in <module>
train()
File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 241, in train
trainer.train()
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1664, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1741, in _inner_training_loop
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/__init__.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer
optimizer = DeepSpeedCPUAdam(model_parameters,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
self.ds_opt_adam = CPUAdamBuilder().load()
^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
Loading extension module cpu_adam...
op_module = load(name=self.name,
^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
Traceback (most recent call last):
File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 247, in <module>
return _jit_compile(
^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
train()
File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 241, in train
trainer.train()
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1664, in train
_write_ninja_file_and_build_library(
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1741, in _inner_training_loop
raise RuntimeError(message) from e
RuntimeError: Error building extension 'cpu_adam'
Loading extension module cpu_adam...
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
Traceback (most recent call last):
File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 247, in <module>
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
train()
File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 241, in train
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
trainer.train() File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/__init__.py", line 165, in initialize
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1664, in train
engine = DeepSpeedEngine(args=args,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1741, in _inner_training_loop
basic_optimizer = self._configure_basic_optimizer(model_parameters)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
^^ ^optimizer = DeepSpeedCPUAdam(model_parameters,^
^^ ^ ^ ^ ^ ^ ^ ^ ^ ^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
self.ds_opt_adam = CPUAdamBuilder().load()
^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^ File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/__init__.py", line 165, in initialize
return self.jit_load(verbose)
engine = DeepSpeedEngine(args=args,
^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
self._configure_optimizer(optimizer, model_parameters)
op_module = load(name=self.name,
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
basic_optimizer = self._configure_basic_optimizer(model_parameters)
^^^^^^^^^^^^^ ^return _jit_compile(^
^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^ File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer
optimizer = DeepSpeedCPUAdam(model_parameters,
return _import_module_from_library(name, build_directory, is_python_module)
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^ File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
^^^^^^^^^^^^^^^^^^^^^^^^ ^self.ds_opt_adam = CPUAdamBuilder().load()^
^^^^^^^^^ ^ ^ ^ ^ ^ ^ ^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
module = importlib.util.module_from_spec(spec)
^^^^^^^^^^^^^^^^ ^op_module = load(name=self.name,^
^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
^^ File "<frozen importlib._bootstrap>", line 573, in module_from_spec
^^ File "<frozen importlib._bootstrap_external>", line 1233, in create_module
^^ File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
^^ImportError^: ^/root/.cache/torch_extensions/py311_cu118/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory^
^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen importlib._bootstrap>", line 573, in module_from_spec
File "<frozen importlib._bootstrap_external>", line 1233, in create_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py311_cu118/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
Loading extension module cpu_adam...
Traceback (most recent call last):
File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 247, in <module>
train()
File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 241, in train
trainer.train()
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1664, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1741, in _inner_training_loop
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/__init__.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer
optimizer = DeepSpeedCPUAdam(model_parameters,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
self.ds_opt_adam = CPUAdamBuilder().load()
^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen importlib._bootstrap>", line 573, in module_from_spec
File "<frozen importlib._bootstrap_external>", line 1233, in create_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py311_cu118/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7fcaec4a89a0>
Traceback (most recent call last):
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
self.ds_opt_adam.destroy_adam(self.opt_id)
^^^^^^^^^^^^^^^^
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7fbf4e6409a0>
Traceback (most recent call last):
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
self.ds_opt_adam.destroy_adam(self.opt_id)
^^^^^^^^^^^^^^^^
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7f9ce61b09a0>
Traceback (most recent call last):
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7f6c2bf109a0>
Traceback (most recent call last):
File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
self.ds_opt_adam.destroy_adam(self.opt_id)
^^^^^^^^^^^^^^^^
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Expected behavior
Expect the model to use the deepspeed config file and run training
cc @pacman100
Hello, this isn't an issue with DeepSpeed integration. The issue is this:
ImportError: /root/.cache/torch_extensions/py311_cu118/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
...
RuntimeError: Error building extension 'cpu_adam'
Hi, @karths8
You can try rm -rf ~/.cache/torch_extensions/ first.
Related discussion: #14520
rm -rf ~/.cache/torch_extensions/
This does not seem to work for me. The root of the problem lies in fatal error: curand_kernel.h: No such file or directory. If there are any insights on how to solve this issue please let me know. Any help is greatly appreciated!
This isn't an integration issue like pacman100 said. See this: https://github.com/microsoft/DeepSpeed/issues/1846 Looks like an issue with the DeepSpeed pip package, I recommend installing it via conda
This isn't an integration issue like pacman100 said. See this: microsoft/DeepSpeed#1846 Looks like an issue with the DeepSpeed pip package, I recommend installing it via conda
Thanks! I fixed it using this