DeepSpeed
DeepSpeed copied to clipboard
[BUG] fused_adam cannot be installed
Describe the bug
Hi @loadams , I was trying to run the cifar training example by deepspeed cifar10_deepspeed.py
. I first installed deepspeed with DS_BUILD_OPS=1 DS_BUILD_EVOFORMER_ATTN=0 DS_BUILD_SPARSE_ATTN=0 pip install deepspeed --global-option="build_ext" --global-option="-j8"
.
However, the run failed with the error below. Would you please kindly help me with that? Thanks a lot.
ile "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
model_engine, optimizer, trainloader, __ = deepspeed.initialize(
File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/__init__.py", line 176, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 307, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1230, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1307, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
fused_adam_cuda = FusedAdamBuilder().load()
File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 479, in load
return self.jit_load(verbose)
File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 523, in jit_load
op_module = load(name=self.name,
File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1306, in load
return _jit_compile(
File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1736, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2132, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "<frozen importlib._bootstrap>", line 571, in module_from_spec
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1307, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
fused_adam_cuda = FusedAdamBuilder().load()
File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 479, in load
return self.jit_load(verbose)
File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 523, in jit_load
op_module = load(name=self.name,
File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1306, in load
ImportError: /home/zhongad/.cache/torch_extensions/py310_cu121/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
return _jit_compile(
File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1736, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
File "/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2132, in _import_module_from_library
File "<frozen importlib._bootstrap_external>", line 1176, in create_module
module = importlib.util.module_from_spec(spec)
File "<frozen importlib._bootstrap>", line 571, in module_from_spec
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "<frozen importlib._bootstrap_external>", line 1176, in create_module
ImportError: /home/zhongad/.cache/torch_extensions/py310_cu121/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: /home/zhongad/.cache/torch_extensions/py310_cu121/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
[2024-04-04 22:12:45,596] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 1630562
[2024-04-04 22:12:46,240] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 1630563
[2024-04-04 22:12:46,600] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 1630564
[2024-04-04 22:12:46,600] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 1630566
[2024-04-04 22:12:46,623] [ERROR] [launch.py:322:sigkill_handler] ['/home/zhongad/.conda/envs/embodiedgpt/bin/python', '-u', 'cifar10_deepspeed.py', '--local_rank=3'] exits with return code = 1
ds_report output
[2024-04-04 22:12:58,166] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
[WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/torch']
torch version .................... 2.2.2+cu121
deepspeed install path ........... ['/home/zhongad/.conda/envs/embodiedgpt/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.14.0, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.2
deepspeed wheel compiled w. ...... torch 2.2, cuda 12.1
shared memory (/dev/shm) size .... 1007.78 GB
System info (please complete the following information):
Output of nvcc -V
:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0
Thanks @HaFred - this looks like other users have reported this as well. Does this fail as well if you just run DS_BUILD_FUSED_ADAM= pip install deepspeed
?
no it doesn't help
Please consider trying this:
git clone https://github.com/microsoft/DeepSpeed.git cd DeepSpeed DS_BUILD_UTILS=1 DS_BUILD_FUSED_ADAM=1 pip install .
If you encounter a mismatch between cuda and gcc, consider lowering the gcc version and running it again. I hope this helps.
@HaFred, @iwannabewater, @riyaj8888 - are you able to test with the latest DeepSpeed? This should be resolved if you build from source, as we believe we resolved the issue you were hitting in https://github.com/microsoft/DeepSpeed/pull/5780.
Please re-open if you are still hitting this?