transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Problem with Deepspeed integration

Open karths8 opened this issue 2 years ago • 4 comments

System Info

  • transformers version: 4.29.2
  • Platform: Linux-5.4.0-137-generic-x86_64-with-glibc2.31
  • Python version: 3.11.3
  • Huggingface_hub version: 0.15.1
  • Safetensors version: 0.3.1
  • PyTorch version (GPU?): 2.0.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Yes

Who can help?

No response

Information

  • [x] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [x] My own task or dataset (give details below)

Reproduction

I am using the WizardCoder training script to further fine-tune the model on some examples that I have using DeepSpeed integration. I have followed their instructions given here to fine-tune the model and I am getting the following error:

datachat_env) [email protected]:~/Custom-LLM$ sh train.sh
[2023-06-23 00:36:25,039] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-06-23 00:36:25,077] [INFO] [runner.py:541:main] cmd = /root/anaconda3/envs/datachat_env/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None /root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py --model_name_or_path /root/Custom-LLM/WizardCoder-15B-V1.0 --data_path /root/Custom-LLM/data.json --output_dir /root/Custom-LLM/WC-Checkpoint --num_train_epochs 3 --model_max_length 512 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 50 --save_total_limit 2 --learning_rate 2e-5 --warmup_steps 30 --logging_steps 2 --lr_scheduler_type cosine --report_to tensorboard --gradient_checkpointing True --deepspeed /root/Custom-LLM/Llama-X/src/configs/deepspeed_config.json --fp16 True
[2023-06-23 00:36:26,992] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-06-23 00:36:26,993] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-06-23 00:36:26,993] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-06-23 00:36:26,993] [INFO] [launch.py:247:main] dist_world_size=4
[2023-06-23 00:36:26,993] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-06-23 00:36:29,650] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-06-23 00:36:55,124] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 15.82B parameters
[2023-06-23 00:37:12,845] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-06-23 00:37:12,968] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-06-23 00:37:12,969] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
[2023-06-23 00:37:12,970] [WARNING] [cpu_adam.py:84:__init__] FP16 params for CPUAdam may not work on AMD CPUs
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
FAILED: cpu_adam.o 
c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
In file included from /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes/cpu_adam.h:19,
                 from /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp:6:
/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes/custom_cuda_layers.h:12:10: fatal error: curand_kernel.h: No such file or directory
   12 | #include <curand_kernel.h>
      |          ^~~~~~~~~~~~~~~~~
compilation terminated.
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
[2/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
FAILED: custom_cuda_kernel.cuda.o 
/usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/TH -isystem /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/anaconda3/envs/datachat_env/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -c /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
In file included from /root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu:6:
/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/csrc/includes/custom_cuda_layers.h:12:10: fatal error: curand_kernel.h: No such file or directory
   12 | #include <curand_kernel.h>
      |          ^~~~~~~~~~~~~~~~~
compilation terminated.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
    subprocess.run(
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 247, in <module>
    train()
  File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 241, in train
    trainer.train()
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1664, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1741, in _inner_training_loop
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
                                                ^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/__init__.py", line 165, in initialize
    engine = DeepSpeedEngine(args=args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer
    optimizer = DeepSpeedCPUAdam(model_parameters,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
    self.ds_opt_adam = CPUAdamBuilder().load()
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
    return self.jit_load(verbose)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
Loading extension module cpu_adam...    
op_module = load(name=self.name,
                ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
Traceback (most recent call last):
  File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 247, in <module>
    return _jit_compile(
           ^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
    train()
  File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 241, in train
    trainer.train()
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1664, in train
    _write_ninja_file_and_build_library(
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1741, in _inner_training_loop
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'cpu_adam'
Loading extension module cpu_adam...
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
                                                ^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
Traceback (most recent call last):
  File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 247, in <module>
    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
                         train() 
               File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 241, in train
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    trainer.train()  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/__init__.py", line 165, in initialize

  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1664, in train
    engine = DeepSpeedEngine(args=args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1741, in _inner_training_loop
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
                                                ^^    ^optimizer = DeepSpeedCPUAdam(model_parameters,^
^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 
    File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
                self.ds_opt_adam = CPUAdamBuilder().load() 
                                                       ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/__init__.py", line 165, in initialize
    return self.jit_load(verbose)    
engine = DeepSpeedEngine(args=args,
                      ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
    self._configure_optimizer(optimizer, model_parameters)
    op_module = load(name=self.name,
        File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
          ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
                      ^^^^^^^^^^^^^    ^return _jit_compile(^
^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer
    optimizer = DeepSpeedCPUAdam(model_parameters,
       return _import_module_from_library(name, build_directory, is_python_module) 
             ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
^^^^^^^^^^^^^^^^^^^^^^^^    ^self.ds_opt_adam = CPUAdamBuilder().load()^
^^^^^^^^^ ^ ^ ^ ^ ^ ^ ^ 
    File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
    return self.jit_load(verbose)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
    module = importlib.util.module_from_spec(spec)
             ^^^^^^^^^^^^^^^^    ^op_module = load(name=self.name,^
^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 
^^  File "<frozen importlib._bootstrap>", line 573, in module_from_spec
^^  File "<frozen importlib._bootstrap_external>", line 1233, in create_module
^^  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
^^ImportError^: ^/root/.cache/torch_extensions/py311_cu118/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory^
^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    return _jit_compile(
           ^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 573, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1233, in create_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py311_cu118/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
Loading extension module cpu_adam...
Traceback (most recent call last):
  File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 247, in <module>
    train()
  File "/root/Custom-LLM/WizardLM/WizardCoder/src/train_wizardcoder.py", line 241, in train
    trainer.train()
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1664, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/trainer.py", line 1741, in _inner_training_loop
    deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
                                                ^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/transformers/deepspeed.py", line 378, in deepspeed_init
    deepspeed_engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/__init__.py", line 165, in initialize
    engine = DeepSpeedEngine(args=args,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 308, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1218, in _configure_basic_optimizer
    optimizer = DeepSpeedCPUAdam(model_parameters,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
    self.ds_opt_adam = CPUAdamBuilder().load()
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
    return self.jit_load(verbose)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
    op_module = load(name=self.name,
                ^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    return _jit_compile(
           ^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 573, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1233, in create_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py311_cu118/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7fcaec4a89a0>
Traceback (most recent call last):
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
    self.ds_opt_adam.destroy_adam(self.opt_id)
    ^^^^^^^^^^^^^^^^
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7fbf4e6409a0>
Traceback (most recent call last):
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
    self.ds_opt_adam.destroy_adam(self.opt_id)
    ^^^^^^^^^^^^^^^^
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7f9ce61b09a0>
Traceback (most recent call last):
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7f6c2bf109a0>
Traceback (most recent call last):
  File "/root/anaconda3/envs/datachat_env/lib/python3.11/site-packages/deepspeed/ops/adam/cpu_adam.py", line 102, in __del__
    self.ds_opt_adam.destroy_adam(self.opt_id)
    ^^^^^^^^^^^^^^^^
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'

Expected behavior

Expect the model to use the deepspeed config file and run training

karths8 avatar Jun 23 '23 00:06 karths8

cc @pacman100

sgugger avatar Jun 23 '23 12:06 sgugger

Hello, this isn't an issue with DeepSpeed integration. The issue is this:

ImportError: /root/.cache/torch_extensions/py311_cu118/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory
...

RuntimeError: Error building extension 'cpu_adam'

pacman100 avatar Jun 23 '23 12:06 pacman100

Hi, @karths8

You can try rm -rf ~/.cache/torch_extensions/ first.

Related discussion: #14520

ydshieh avatar Jun 27 '23 08:06 ydshieh

rm -rf ~/.cache/torch_extensions/

This does not seem to work for me. The root of the problem lies in fatal error: curand_kernel.h: No such file or directory. If there are any insights on how to solve this issue please let me know. Any help is greatly appreciated!

karths8 avatar Jun 29 '23 14:06 karths8

This isn't an integration issue like pacman100 said. See this: https://github.com/microsoft/DeepSpeed/issues/1846 Looks like an issue with the DeepSpeed pip package, I recommend installing it via conda

orangetin avatar Jul 02 '23 07:07 orangetin

This isn't an integration issue like pacman100 said. See this: microsoft/DeepSpeed#1846 Looks like an issue with the DeepSpeed pip package, I recommend installing it via conda

Thanks! I fixed it using this

karths8 avatar Jul 02 '23 17:07 karths8