DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] fused_adam cannot be installed

Open HaFred opened this issue 10 months ago • 2 comments

Describe the bug Hi @loadams , I was trying to run the cifar training example by deepspeed cifar10_deepspeed.py. I first installed deepspeed with DS_BUILD_OPS=1 DS_BUILD_EVOFORMER_ATTN=0 DS_BUILD_SPARSE_ATTN=0 pip install deepspeed --global-option="build_ext" --global-option="-j8".

However, the run failed with the error below. Would you please kindly help me with that? Thanks a lot.

ile "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
    model_engine, optimizer, trainloader, __ = deepspeed.initialize(
  File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/__init__.py", line 176, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 307, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1230, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1307, in _configure_basic_optimizer
    optimizer = FusedAdam(
  File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 479, in load
    return self.jit_load(verbose)
  File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 523, in jit_load
    op_module = load(name=self.name,
  File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1306, in load
    return _jit_compile(
  File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1736, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2132, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1307, in _configure_basic_optimizer
    optimizer = FusedAdam(
  File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 479, in load
    return self.jit_load(verbose)
  File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 523, in jit_load
    op_module = load(name=self.name,
  File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1306, in load
ImportError: /home/zhongad/.cache/torch_extensions/py310_cu121/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
    return _jit_compile(
  File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1736, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2132, in _import_module_from_library
  File "<frozen importlib._bootstrap_external>", line 1176, in create_module
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 1176, in create_module
ImportError: /home/zhongad/.cache/torch_extensions/py310_cu121/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: /home/zhongad/.cache/torch_extensions/py310_cu121/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
[2024-04-04 22:12:45,596] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 1630562
[2024-04-04 22:12:46,240] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 1630563
[2024-04-04 22:12:46,600] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 1630564
[2024-04-04 22:12:46,600] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 1630566
[2024-04-04 22:12:46,623] [ERROR] [launch.py:322:sigkill_handler] ['/home/zhongad/.conda/envs/embodiedgpt/bin/python', '-u', 'cifar10_deepspeed.py', '--local_rank=3'] exits with return code = 1

ds_report output

[2024-04-04 22:12:58,166] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/torch']
torch version .................... 2.2.2+cu121
deepspeed install path ........... ['/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.14.0, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.2
deepspeed wheel compiled w. ...... torch 2.2, cuda 12.1
shared memory (/dev/shm) size .... 1007.78 GB

System info (please complete the following information): Output of nvcc -V:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0

HaFred avatar Apr 04 '24 14:04 HaFred

Thanks @HaFred - this looks like other users have reported this as well. Does this fail as well if you just run DS_BUILD_FUSED_ADAM= pip install deepspeed?

loadams avatar Apr 11 '24 22:04 loadams

no it doesn't help

riyaj8888 avatar Apr 30 '24 09:04 riyaj8888

Please consider trying this:

git clone https://github.com/microsoft/DeepSpeed.git cd DeepSpeed DS_BUILD_UTILS=1 DS_BUILD_FUSED_ADAM=1 pip install .

If you encounter a mismatch between cuda and gcc, consider lowering the gcc version and running it again. I hope this helps.

iwannabewater avatar Jul 16 '24 05:07 iwannabewater

@HaFred, @iwannabewater, @riyaj8888 - are you able to test with the latest DeepSpeed? This should be resolved if you build from source, as we believe we resolved the issue you were hitting in https://github.com/microsoft/DeepSpeed/pull/5780.

Please re-open if you are still hitting this?

loadams avatar Aug 14 '24 21:08 loadams