DeepSpeed-MII icon indicating copy to clipboard operation
DeepSpeed-MII copied to clipboard

Unable to load ragged_device_ops op due to no compute capabilities remaining after filtering

Open rogerbock opened this issue 1 year ago • 10 comments

I get this error following the deepspeed-fastgen instructions:

from mii import pipeline
pipe = pipeline("mistralai/Mistral-7B-v0.1")

The full stack trace is:

Loading extension module inference_core_ops...
Time to load inference_core_ops op: 32.12497305870056 seconds
Installed CUDA version 11.0 does not match the version torch was compiled with 11.1 but since the APIs are compatible, accepting this combination
 [WARNING]  Filtered compute capabilities ['7.0+PTX']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/mii/pipeline.py", line 32, in pipeline
    inference_engine = load_model(model_config)
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/mii/models.py", line 17, in load_model
    inference_engine = build_hf_engine(
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/deepspeed/inference/v2/engine_factory.py", line 46, in build_hf_engine
    return InferenceEngineV2(policy, engine_config)
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/deepspeed/inference/v2/engine_v2.py", line 65, in __init__
    self._model = self._policy.build_model(self._config, self._base_mp_group)
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/deepspeed/inference/v2/model_implementations/inference_policy_base.py", line 110, in build_model
    self.model = self.instantiate_model(engine_config, mp_group)
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/deepspeed/inference/v2/model_implementations/mistral/policy.py", line 23, in instantiate_model
    return MistralInferenceModel(config=self._model_config, engine_config=engine_config, base_mp_group=mp_group)
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 229, in __init__
    self.make_attn_layer()
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 346, in make_attn_layer
    self.attn = heuristics.instantiate_attention(attn_config, self._engine_config)
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/deepspeed/inference/v2/modules/heuristics.py", line 53, in instantiate_attention
    return DSSelfAttentionRegistry.instantiate_config(config)
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/deepspeed/inference/v2/modules/module_registry.py", line 39, in instantiate_config
    return cls.registry[config_bundle.name](config_bundle.config, config_bundle.implementation_config)
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/deepspeed/inference/v2/modules/implementations/attention/dense_blocked_attention.py", line 88, in __init__
    self._kv_copy = BlockedRotaryEmbeddings(self._config.head_size, self._config.n_heads_q,
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/deepspeed/inference/v2/kernels/ragged_ops/linear_blocked_kv_rotary/blocked_kv_rotary.py", line 49, in __init__
    inf_module = RaggedOpsBuilder().load()
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 452, in load
    return self.jit_load(verbose)
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 488, in jit_load
    nvcc_args = self.strip_empty_entries(self.nvcc_args())
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 689, in nvcc_args
    args += self.compute_capability_args()
  File "/opt/conda/envs/py39/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 569, in compute_capability_args
    raise RuntimeError(
RuntimeError: Unable to load ragged_device_ops op due to no compute capabilities remaining after filtering

And here is my environment:

$ ds_report
[2023-11-08 19:21:20,365] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/envs/py39/lib/python3.9/site-packages/torch']
torch version .................... 1.10.1+cu111
deepspeed install path ........... ['/opt/conda/envs/py39/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.12.2, unknown, unknown
torch cuda version ............... 11.1
torch hip version ................ None
nvcc version ..................... 11.0
deepspeed wheel compiled w. ...... torch 1.10, cuda 11.1
shared memory (/dev/shm) size .... 25.55 GB

rogerbock avatar Nov 08 '23 19:11 rogerbock

I suspect the issue is my CUDA driver is not a high enough version. I just saw the following in the documentation.

We have found this library to be very portable across environments with NVIDIA GPUs with compute capabilities 8.0+ (Ampere+), CUDA 11.6+, and Ubuntu 20+.

rogerbock avatar Nov 08 '23 19:11 rogerbock

@rogerbock you are correct. You will need a GPU + CUDA version that supports compute capabilities 8.0+

mrwyattii avatar Nov 08 '23 23:11 mrwyattii

Sounds good, thank you for confirming that is the case! Ideally the error message would be improved to communicate that this is the underlying issue. I'll see if I can track down some better GPUs. 🙂

Feel free to close this (or keep it open if you think fixing the error message is feasible).

rogerbock avatar Nov 09 '23 13:11 rogerbock

@rogerbock you're right, that error message could be better! I'll look into how we can improve it. Would something like this be better?

Unable to load ragged_device_ops op due to no compute capabilities remaining after filtering. Compute capabilities found: {7.0}, Compute capabilities required: 8.0+

mrwyattii avatar Nov 09 '23 17:11 mrwyattii

Yes, that would be great! I would maybe say "CUDA compute capabilities" to be more explicit.

As a point of comparison, this is the error we got from vllm when we ran into a similar incompatibility issue:

ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-SXM2-16GB GPU has compute capability 7.0.

rogerbock avatar Nov 09 '23 18:11 rogerbock

Thank you for the feedback! We will incorporate your suggestion in an upcoming release. Please keep this issue open for the time being and I will close it when the improvement has been fulfilled.

mrwyattii avatar Nov 09 '23 21:11 mrwyattii

so, we cannot run it on SM=7.5 GPU card, right?

SeekPoint avatar Nov 15 '23 15:11 SeekPoint

so, we cannot run it on SM=7.5 GPU card, right?

That is correct for the latest MII release. But we still support <sm_80 with MII-Legacy APIs!

mrwyattii avatar Nov 15 '23 19:11 mrwyattii

i have the same error,but my cuda version is 11.8

ds_report
[2024-01-10 20:06:22,681] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/powerop/work/conda/envs/deepspeed/lib/python3.10/site-packages/torch']
torch version .................... 2.1.2+cu118
deepspeed install path ........... ['/home/powerop/work/conda/envs/deepspeed/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.12.6, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.1, cuda 11.8
shared memory (/dev/shm) size .... 45.24 GB

my error is :

line 458, in load
    return self.jit_load(verbose)
  File "/home/powerop/work/conda/envs/deepspeed/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 489, in jit_load
    nvcc_args = self.strip_empty_entries(self.nvcc_args())
  File "/home/powerop/work/conda/envs/deepspeed/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 693, in nvcc_args
    args += self.compute_capability_args()
  File "/home/powerop/work/conda/envs/deepspeed/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 570, in compute_capability_args
    raise RuntimeError(
RuntimeError: Unable to load ragged_device_ops op due to no compute capabilities remaining after filtering

ArlanCooper avatar Jan 10 '24 12:01 ArlanCooper

i have the same error,but my cuda version is 12.1

my env is: at May 18 12:17:16 2024
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 12.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:1A:00.0 Off | 0 | | N/A 49C P0 73W / 300W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

my error is:

pspeed/inference/v2/kernels/core_ops/cuda_linear/linear_kernels_cuda.cu:15:
/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_mma.cuh:59:2: warning: #warning "The matrix load functions are only supported on Ampere and newer architectures" [-Wcpp]
   59 | #warning "The matrix load functions are only supported on Ampere and newer architectures"
      |  ^~~~~~~
/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_mma.cuh:133:2: warning: #warning "The mma functions are only implemented for Ampere and newer architectures" [-Wcpp]
  133 | #warning "The mma functions are only implemented for Ampere and newer architectures"
      |  ^~~~~~~
In file included from /mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/utils_gmem.cuh:13,
                 from /mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/kernel_matmul.cuh:13,
                 from /mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/linear_kernels_cuda.cu:15:
/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_cp.async.cuh:33:2: warning: #warning "The async copy functions are only supported on Ampere and newer architectures" [-Wcpp]
   33 | #warning "The async copy functions are only supported on Ampere and newer architectures"
      |  ^~~~~~~
/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_cp.async.cuh:43:2: warning: #warning "The async copy functions are only supported on Ampere and newer architectures" [-Wcpp]
   43 | #warning "The async copy functions are only supported on Ampere and newer architectures"
      |  ^~~~~~~
/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_cp.async.cuh:54:2: warning: #warning "The async copy functions are only supported on Ampere and newer architectures" [-Wcpp]
   54 | #warning "The async copy functions are only supported on Ampere and newer architectures"
      |  ^~~~~~~
/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_cp.async.cuh:67:2: warning: #warning "The async copy functions are only supported on Ampere and newer architectures" [-Wcpp]
   67 | #warning "The async copy functions are only supported on Ampere and newer architectures"
      |  ^~~~~~~
In file included from /mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/linear_kernels_cuda.cu:15:
/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/kernel_matmul.cuh:268:2: warning: #warning "The FP6 functions are only available on Ampere GPUs." [-Wcpp]
  268 | #warning "The FP6 functions are only available on Ampere GPUs."
      |  ^~~~~~~
In file included from /mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/utils_core.cuh:14,
                 from /mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/kernel_matmul.cuh:12,
                 from /mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/linear_kernels_cuda.cu:15:
/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_mma.cuh:59:2: warning: #warning "The matrix load functions are only supported on Ampere and newer architectures" [-Wcpp]
   59 | #warning "The matrix load functions are only supported on Ampere and newer architectures"
      |  ^~~~~~~
/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_mma.cuh:133:2: warning: #warning "The mma functions are only implemented for Ampere and newer architectures" [-Wcpp]
  133 | #warning "The mma functions are only implemented for Ampere and newer architectures"
      |  ^~~~~~~
In file included from /mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/utils_gmem.cuh:13,
                 from /mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/kernel_matmul.cuh:13,
                 from /mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/linear_kernels_cuda.cu:15:
/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_cp.async.cuh:33:2: warning: #warning "The async copy functions are only supported on Ampere and newer architectures" [-Wcpp]
   33 | #warning "The async copy functions are only supported on Ampere and newer architectures"
      |  ^~~~~~~
/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_cp.async.cuh:43:2: warning: #warning "The async copy functions are only supported on Ampere and newer architectures" [-Wcpp]
   43 | #warning "The async copy functions are only supported on Ampere and newer architectures"
      |  ^~~~~~~
/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_cp.async.cuh:54:2: warning: #warning "The async copy functions are only supported on Ampere and newer architectures" [-Wcpp]
   54 | #warning "The async copy functions are only supported on Ampere and newer architectures"
      |  ^~~~~~~
/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/ptx_cp.async.cuh:67:2: warning: #warning "The async copy functions are only supported on Ampere and newer architectures" [-Wcpp]
   67 | #warning "The async copy functions are only supported on Ampere and newer architectures"
      |  ^~~~~~~
In file included from /mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/linear_kernels_cuda.cu:15:
/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/core_ops/cuda_linear/include/kernel_matmul.cuh:268:2: warning: #warning "The FP6 functions are only available on Ampere GPUs." [-Wcpp]
  268 | #warning "The FP6 functions are only available on Ampere GPUs."
      |  ^~~~~~~
[12/12] c++ core_ops.o bias_activation.o bias_activation_cuda.cuda.o layer_norm.o layer_norm_cuda.cuda.o rms_norm.o rms_norm_cuda.cuda.o gated_activation_kernels.o gated_activation_kernels_cuda.cuda.o linear_kernels.o linear_kernels_cuda.cuda.o -shared -L/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o inference_core_ops.so
Loading extension module inference_core_ops...
Time to load inference_core_ops op: 159.55809497833252 seconds
 [WARNING]  Filtered compute capabilities ['7.0+PTX']
[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/workspace/demo/test_deepspeed/test_deepspeed.py", line 2, in <module>
[rank0]:     pipe = mii.pipeline("/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/workspace/models_zoo/llama-7b")
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/mii/api.py", line 207, in pipeline
[rank0]:     inference_engine = load_model(model_config)
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/mii/modeling/models.py", line 17, in load_model
[rank0]:     inference_engine = build_hf_engine(
[rank0]:                        ^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/engine_factory.py", line 129, in build_hf_engine
[rank0]:     return InferenceEngineV2(policy, engine_config)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/engine_v2.py", line 83, in __init__
[rank0]:     self._model = self._policy.build_model(self._config, self._base_mp_group)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/model_implementations/inference_policy_base.py", line 156, in build_model
[rank0]:     self.model = self.instantiate_model(engine_config, mp_group)
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/model_implementations/llama_v2/policy.py", line 17, in instantiate_model
[rank0]:     return Llama2InferenceModel(config=self._model_config, engine_config=engine_config, base_mp_group=mp_group)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 217, in __init__
[rank0]:     self.make_attn_layer()
[rank0]:   File "/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 334, in make_attn_layer
[rank0]:     self.attn = heuristics.instantiate_attention(attn_config, self._engine_config)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/modules/heuristics.py", line 53, in instantiate_attention
[rank0]:     return DSSelfAttentionRegistry.instantiate_config(config)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/modules/module_registry.py", line 39, in instantiate_config
[rank0]:     return cls.registry[config_bundle.name](config_bundle.config, config_bundle.implementation_config)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/modules/implementations/attention/dense_blocked_attention.py", line 100, in __init__
[rank0]:     self._kv_copy = BlockedRotaryEmbeddings(self._config.head_size, self._config.n_heads_q,
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/inference/v2/kernels/ragged_ops/linear_blocked_kv_rotary/blocked_kv_rotary.py", line 50, in __init__
[rank0]:     inf_module = RaggedOpsBuilder().load()
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in load
[rank0]:     return self.jit_load(verbose)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 511, in jit_load
[rank0]:     nvcc_args = self.strip_empty_entries(self.nvcc_args())
[rank0]:                                          ^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 722, in nvcc_args
[rank0]:     args += self.compute_capability_args()
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/bn/pan-personal-bytenas/mlx/users/yinchangpan/software/conda/envs/ycp_py311_mii/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 592, in compute_capability_args
[rank0]:     raise RuntimeError(
[rank0]: RuntimeError: Unable to load ragged_device_ops op due to no compute capabilities remaining after filtering
INFO[0225] Worker 0 Status Failed                        host="fdbd:dc02:16:653::43" message= reason=Error
error: exec command: 0



Greatpanc avatar May 18 '24 04:05 Greatpanc