[Installation]: Install v0.6.6/v0.6.7 on amd gpu gfx906 failed, v0.6.5 success but cannot run gptq
Your current environment
Hello ,
I can install 0.6.2post1 and 0.6.5 with rocm6.2.2 successful in my pc. But i meet some issue wher I'm running qwq 32b awq . So I want to build the newest v0.6.7 version but failed. I have rocm but the script wants to find cuda.
2 warnings generated when compiling for gfx906. In file included from /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:3, from /root/aphrodite-engine-new/kernels/flash_attn/flash_api.h:10, from /root/aphrodite-engine-new/kernels/torch_bindings.cpp:6: /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContextLight.h:6:10: fatal error: cuda_runtime_api.h: No such file or directory 6 | #include <cuda_runtime_api.h> | ^~~~~~~~~~~~~~~~~~~~ compilation terminated.
(aphroditenew) root@epyc:~/aphrodite-engine-new# python3 setup.py develop running develop /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/setuptools/command/develop.py:41: EasyInstallDeprecationWarning: easy_install command is deprecated. !!
********************************************************************************
Please avoid running ``setup.py`` and ``easy_install``.
Instead, use pypa/build, pypa/installer or other
standards-based tools.
See https://github.com/pypa/setuptools/issues/917 for details.
********************************************************************************
!! easy_install.initialize_options(self) /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/setuptools/_distutils/cmd.py:79: SetuptoolsDeprecationWarning: setup.py install is deprecated. !!
********************************************************************************
Please avoid running ``setup.py`` directly.
Instead, use pypa/build, pypa/installer or other
standards-based tools.
See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
********************************************************************************
!! self.initialize_options() running egg_info writing aphrodite_engine.egg-info/PKG-INFO writing dependency_links to aphrodite_engine.egg-info/dependency_links.txt writing entry points to aphrodite_engine.egg-info/entry_points.txt writing requirements to aphrodite_engine.egg-info/requires.txt writing top-level names to aphrodite_engine.egg-info/top_level.txt reading manifest file 'aphrodite_engine.egg-info/SOURCES.txt' reading manifest template 'MANIFEST.in' adding license file 'LICENSE' writing manifest file 'aphrodite_engine.egg-info/SOURCES.txt' running build_ext Using 64 CPUs as the number of jobs. -- The CXX compiler identification is GNU 12.3.0 -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Build type: RelWithDebInfo -- Target device: cuda -- Found Python: /root/miniconda3/envs/aphroditenew/bin/python3 (found version "3.10.16") found components: Interpreter Development.Module Development.SABIModule -- Found python matching: /root/miniconda3/envs/aphroditenew/bin/python3. Building PyTorch for GPU arch: gfx906 -- Found HIP: /opt/rocm-6.2.2 (found suitable version "6.2.41134-65d174c3e", minimum required is "1.0") HIP VERSION: 6.2.41134-65d174c3e
***** ROCm version from rocm_version.h ****
ROCM_VERSION_DEV: 6.2.2 ROCM_VERSION_DEV_MAJOR: 6 ROCM_VERSION_DEV_MINOR: 2 ROCM_VERSION_DEV_PATCH: 2 ROCM_VERSION_DEV_INT: 60202 HIP_VERSION_MAJOR: 6 HIP_VERSION_MINOR: 2 TORCH_HIP_VERSION: 602
***** Library versions from cmake find_package *****
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success -- Found Threads: TRUE hip VERSION: 6.2.41134 hsa-runtime64 VERSION: 1.14.60202 amd_comgr VERSION: 2.8.0 rocrand VERSION: 3.1.0 hiprand VERSION: 2.11.0 rocblas VERSION: 4.2.1 hipblas VERSION: 2.2.0 hipblaslt VERSION: 0.8.0 miopen VERSION: 3.2.0 hipfft VERSION: 1.0.15 hipsparse VERSION: 3.1.1 rccl VERSION: 2.20.5 rocprim VERSION: 3.2.0 hipcub VERSION: 3.2.0 rocthrust VERSION: 3.1.0 hipsolver VERSION: 2.2.0 CMake Deprecation Warning at /opt/rocm/lib/cmake/hiprtc/hiprtc-config.cmake:21 (cmake_minimum_required): Compatibility with CMake < 3.10 will be removed from a future version of CMake.
Update the VERSION argument
hiprtc VERSION: 6.2.41134 HIP is using new type enums CMake Warning at /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message): static library kineto_LIBRARY-NOTFOUND not found. Call Stack (most recent call first): /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:121 (append_torchlib_if_found) CMakeLists.txt:70 (find_package)
-- Found Torch: /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/lib/libtorch.so -- Enabling core extension. -- The HIP compiler identification is Clang 18.0.0 -- Detecting HIP compiler ABI info -- Detecting HIP compiler ABI info - done -- Check for working HIP compiler: /opt/rocm-6.2.2/lib/llvm/bin/clang++ - skipped -- Detecting HIP compile features -- Detecting HIP compile features - done CMake Warning at CMakeLists.txt:154 (message): Pytorch version >= 2.5.0 expected for ROCm build, saw 2.6.0 instead.
-- HIP supported arches: gfx906;gfx908;gfx90a;gfx940;gfx941;gfx942;gfx1030;gfx1100;gfx1101 -- HIP target arches: gfx906;gfx906 -- Enabling C extension. -- Enabling moe extension. -- Enabling rocm extension. -- Configuring done (8.9s) -- Generating done (0.0s) -- Build files have been written to: /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310 Using 64 CPUs as the number of jobs. [ 50%] Building CXX object CMakeFiles/_core_C.dir/kernels/core/torch_bindings.cpp.o cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++ [100%] Linking CXX shared module /root/aphrodite-engine-new/build/lib.linux-x86_64-cpython-310/aphrodite/_core_C.abi3.so [100%] Built target _core_C [ 25%] Running hipify on _moe_C extension source files. /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/cuda_compat.h -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/hip_compat.h [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/moe/softmax.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/moe/softmax.hip [ok] Successfully preprocessed all matching files. Total number of unsupported CUDA function calls: 0
Total number of replaced kernel launches: 3 /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/moe/softmax.hip [ 25%] Built target hipify_moe_C [ 50%] Building HIP object CMakeFiles/_moe_C.dir/kernels/moe/softmax.hip.o [ 75%] Building CXX object CMakeFiles/_moe_C.dir/kernels/moe/torch_bindings.cpp.o cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++ [100%] Linking HIP shared module /root/aphrodite-engine-new/build/lib.linux-x86_64-cpython-310/aphrodite/_moe_C.abi3.so [100%] Built target _moe_C [ 6%] Running hipify on _C extension source files. /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/cuda_compat.h -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/hip_compat.h [skipped, already hipified] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/dispatch_utils.h -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/dispatch_utils.h [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/amd/hip_float8_impl.h -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/amd/hip_float8_impl.h [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/amd/hip_float8.h -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/amd/hip_float8.h [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/attention_generic.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/attention_generic.cuh [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/dtype_fp8.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/dtype_fp8.cuh [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/dtype_float32.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/dtype_float32.cuh [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/dtype_bfloat16.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/dtype_bfloat16_hip.cuh [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/amd/quant_utils.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/amd/quant_utils_hip.cuh [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/dtype_float16.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/dtype_float16.cuh [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/attention_dtypes.h -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/attention_dtypes_hip.h [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/nvidia/quant_utils.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/nvidia/quant_utils_hip.cuh [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/cache_kernels.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/cache_kernels.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/attention_utils.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/attention_utils_hip.cuh [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/attention_kernels.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/attention_kernels.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/pos_encoding_kernels.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/pos_encoding_kernels.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/activation_kernels.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/activation_kernels.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/layernorm_kernels.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/layernorm_kernels.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/squeezellm/quant_cuda_kernel.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/squeezellm/quant_hip_kernel.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/compat.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/compat.cuh [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/qdq_util.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/qdq_util.cuh [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/matrix_view.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/matrix_view_hip.cuh [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/qdq_2.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/qdq_2.cuh [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/qdq_3.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/qdq_3.cuh [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/qdq_4.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/qdq_4.cuh [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/qdq_8.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/qdq_8.cuh [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/q_gemm.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/q_gemm.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/compressed_tensors/int8_quant_kernels.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/compressed_tensors/int8_quant_kernels.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/common.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/common.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/cuda_utils_kernels.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/hip_utils_kernels.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/moe/align_block_size_kernel.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/moe/align_block_size_kernel.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/prepare_inputs/advance_step.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/prepare_inputs/advance_step_hip.cuh [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/prepare_inputs/advance_step.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/prepare_inputs/advance_step.hip [ok] Successfully preprocessed all matching files. Total number of unsupported CUDA function calls: 0
Total number of replaced kernel launches: 37 /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/cache_kernels.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/attention_kernels.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/pos_encoding_kernels.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/activation_kernels.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/layernorm_kernels.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/squeezellm/quant_hip_kernel.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/q_gemm.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/compressed_tensors/int8_quant_kernels.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/common.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/hip_utils_kernels.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/moe/align_block_size_kernel.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/prepare_inputs/advance_step.hip [ 6%] Built target hipify_C [ 13%] Building HIP object CMakeFiles/_C.dir/kernels/cache_kernels.hip.o [ 20%] Building HIP object CMakeFiles/_C.dir/kernels/attention/attention_kernels.hip.o [ 26%] Building HIP object CMakeFiles/_C.dir/kernels/pos_encoding_kernels.hip.o [ 33%] Building HIP object CMakeFiles/_C.dir/kernels/activation_kernels.hip.o [ 40%] Building HIP object CMakeFiles/_C.dir/kernels/layernorm_kernels.hip.o [ 46%] Building HIP object CMakeFiles/_C.dir/kernels/quantization/squeezellm/quant_hip_kernel.hip.o [ 53%] Building HIP object CMakeFiles/_C.dir/kernels/quantization/gptq/q_gemm.hip.o [ 60%] Building HIP object CMakeFiles/_C.dir/kernels/quantization/compressed_tensors/int8_quant_kernels.hip.o [ 66%] Building HIP object CMakeFiles/_C.dir/kernels/quantization/fp8/common.hip.o [ 73%] Building HIP object CMakeFiles/_C.dir/kernels/hip_utils_kernels.hip.o [ 80%] Building HIP object CMakeFiles/_C.dir/kernels/moe/align_block_size_kernel.hip.o [ 86%] Building CXX object CMakeFiles/_C.dir/kernels/torch_bindings.cpp.o [ 93%] Building HIP object CMakeFiles/_C.dir/kernels/prepare_inputs/advance_step.hip.o cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++ /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/hip_utils_kernels.hip:9:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result] 9 | hipGetDevice(&device); | ^~~~~~~~~~~~ ~~~~~~~ /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/hip_utils_kernels.hip:13:3: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result] 13 | hipDeviceGetAttribute(&value, static_cast<hipDeviceAttribute_t>(attribute), | ^~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14 | device); | ~~~~~~ 2 warnings generated when compiling for gfx906. In file included from /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:3, from /root/aphrodite-engine-new/kernels/flash_attn/flash_api.h:10, from /root/aphrodite-engine-new/kernels/torch_bindings.cpp:6: /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContextLight.h:6:10: fatal error: cuda_runtime_api.h: No such file or directory 6 | #include <cuda_runtime_api.h> | ^~~~~~~~~~~~~~~~~~~~ compilation terminated. /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/hip_utils_kernels.hip:9:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result] 9 | hipGetDevice(&device); | ^~~~~~~~~~~~ ~~~~~~~ /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/hip_utils_kernels.hip:13:3: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result] 13 | hipDeviceGetAttribute(&value, static_cast<hipDeviceAttribute_t>(attribute), | ^~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14 | device); | ~~~~~~ gmake[3]: *** [CMakeFiles/_C.dir/build.make:235: CMakeFiles/_C.dir/kernels/torch_bindings.cpp.o] Error 1 gmake[3]: *** Waiting for unfinished jobs.... 2 warnings generated when compiling for host. ^Cinterrupted gmake[3]: *** [CMakeFiles/_C.dir/build.make:78: CMakeFiles/_C.dir/kernels/cache_kernels.hip.o] Interrupt gmake[3]: *** [CMakeFiles/_C.dir/build.make:104: CMakeFiles/_C.dir/kernels/pos_encoding_kernels.hip.o] Interrupt gmake[3]: *** [CMakeFiles/_C.dir/build.make:91: CMakeFiles/_C.dir/kernels/attention/attention_kernels.hip.o] Interrupt gmake[3]: *** [CMakeFiles/_C.dir/build.make:117: CMakeFiles/_C.dir/kernels/activation_kernels.hip.o] Interrupt gmake[3]: *** [CMakeFiles/_C.dir/build.make:169: CMakeFiles/_C.dir/kernels/quantization/compressed_tensors/int8_quant_kernels.hip.o] Interrupt gmake[3]: *** [CMakeFiles/_C.dir/build.make:208: CMakeFiles/_C.dir/kernels/moe/align_block_size_kernel.hip.o] Interrupt gmake[3]: *** [CMakeFiles/_C.dir/build.make:221: CMakeFiles/_C.dir/kernels/prepare_inputs/advance_step.hip.o] Interrupt gmake[3]: *** [CMakeFiles/_C.dir/build.make:130: CMakeFiles/_C.dir/kernels/layernorm_kernels.hip.o] Interrupt gmake[3]: *** [CMakeFiles/_C.dir/build.make:143: CMakeFiles/_C.dir/kernels/quantization/squeezellm/quant_hip_kernel.hip.o] Interrupt gmake[3]: *** [CMakeFiles/_C.dir/build.make:156: CMakeFiles/_C.dir/kernels/quantization/gptq/q_gemm.hip.o] Interrupt gmake[3]: *** [CMakeFiles/_C.dir/build.make:182: CMakeFiles/_C.dir/kernels/quantization/fp8/common.hip.o] Interrupt gmake[2]: *** [CMakeFiles/Makefile2:202: CMakeFiles/_C.dir/all] Interrupt gmake[1]: *** [CMakeFiles/Makefile2:209: CMakeFiles/_C.dir/rule] Interrupt gmake: *** [Makefile:208: _C] Interrupt
How did you install Aphrodite?
pip install aphrodite-engine
v0.6.5 build and install success but run model qwq 32b awq and gguf failed
(aphroditenew) root@epyc:~# aphrodite run ~/windows/modelscope/QwQ-32B-AWQ/ --max-model-len 8192 --served-model-name aphrodite
INFO: The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
WARNING: awq_marlin kernels are temporarily disabled, they will be re-enabled with a future release. Falling back to AWQ kernels.
WARNING: awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING: Using AWQ quantization with ROCm, but APHRODITE_USE_TRITON_AWQ is not set, enabling APHRODITE_USE_TRITON_AWQ.
INFO: Multiprocessing frontend to use ipc:///tmp/f3118921-e972-466c-a18e-ae9e85ae2aa4 for RPC Path.
INFO: Started engine process with PID 1714518
INFO: The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
WARNING: awq_marlin kernels are temporarily disabled, they will be re-enabled with a future release. Falling back to AWQ kernels.
WARNING: awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING: Using AWQ quantization with ROCm, but APHRODITE_USE_TRITON_AWQ is not set, enabling APHRODITE_USE_TRITON_AWQ.
INFO: Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO: -------------------------------------------------------------------------------------
INFO: Initializing Aphrodite Engine (v0.6.5 commit cbd51a20) with the following config:
INFO: Model = '/root/windows/modelscope/QwQ-32B-AWQ/'
INFO: DataType = torch.float16
INFO: Tensor Parallel Size = 1
INFO: Pipeline Parallel Size = 1
INFO: Disable Custom All-Reduce = True
INFO: Quantization Format = 'awq'
INFO: Context Length = 8192
INFO: Enforce Eager Mode = False
INFO: Prefix Caching = False
INFO: Device = device(type='cuda')
INFO: Guided Decoding Backend = DecodingConfig(guided_decoding_backend='lm-format-enforcer')
INFO: Scheduler Steps = 1
INFO: Async Output Processing = True
INFO: -------------------------------------------------------------------------------------
INFO: Using ROCmFlashAttention backend.
[W405 19:46:07.499544169 socket.cpp:759] [c10d] The client socket cannot be initialized to connect to [::ffff:192.168.2.243]:34505 (errno: 97 - Address family not supported by protocol).
INFO: Loading model /root/windows/modelscope/QwQ-32B-AWQ/...
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
INFO: Using ROCmFlashAttention backend.
⠇ Loading model weights... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 100% 18.00/18.00 GiB 0:00:11
INFO: Model weights loaded in 12.36 seconds.
INFO: Total model weights memory usage: 18.15 GiB
INFO: Profiling peak memory usage...
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in
Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0
/root/aphrodite-engine-new/aphrodite/quantization/awq.py:163: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:296.)
out = torch.matmul(reshaped_x, out)
ERROR: Error in calling custom op top_k_renorm_prob: '_OpNamespace' '_C' object has no attribute 'top_k_renorm_prob'
ERROR: Possibly you have built or installed an obsolete version of aphrodite.
ERROR: Please try a clean build and install of aphrodite,or remove old built files such as aphrodite/.so and build/ .
ERROR: Error in calling custom op top_k_top_p_sampling_from_probs: '_OpNamespace' '_C' object has no attribute 'top_k_renorm_prob'
ERROR: Possibly you have built or installed an obsolete version of aphrodite.
ERROR: Please try a clean build and install of aphrodite,or remove old built files such as aphrodite/.so and build/ .
Process SpawnProcess-1:
Traceback (most recent call last):
File "/root/miniconda3/envs/aphroditenew/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/root/miniconda3/envs/aphroditenew/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/aphrodite-engine-new/aphrodite/endpoints/openai/rpc/server.py", line 229, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, rpc_path)
File "/root/aphrodite-engine-new/aphrodite/endpoints/openai/rpc/server.py", line 39, in init
self.engine = AsyncAphrodite.from_engine_args(async_engine_args)
File "/root/aphrodite-engine-new/aphrodite/engine/async_aphrodite.py", line 741, in from_engine_args
engine = cls(
File "/root/aphrodite-engine-new/aphrodite/engine/async_aphrodite.py", line 630, in init
self.engine = self._init_engine(*args, **kwargs)
File "/root/aphrodite-engine-new/aphrodite/engine/async_aphrodite.py", line 840, in _init_engine
return engine_class(*args, **kwargs)
File "/root/aphrodite-engine-new/aphrodite/engine/async_aphrodite.py", line 263, in init
super().init(*args, **kwargs)
File "/root/aphrodite-engine-new/aphrodite/engine/aphrodite_engine.py", line 307, in init
self._initialize_kv_caches()
File "/root/aphrodite-engine-new/aphrodite/engine/aphrodite_engine.py", line 399, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
File "/root/aphrodite-engine-new/aphrodite/executor/gpu_executor.py", line 111, in determine_num_available_blocks
return self.driver_worker.determine_num_available_blocks()
File "/root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/root/aphrodite-engine-new/aphrodite/task_handler/worker.py", line 199, in determine_num_available_blocks
self.model_runner.profile_run()
File "/root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/root/aphrodite-engine-new/aphrodite/task_handler/model_runner.py", line 1180, in profile_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
File "/root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/root/aphrodite-engine-new/aphrodite/task_handler/model_runner.py", line 1522, in execute_model
output: SamplerOutput = self.model.sample(
File "/root/aphrodite-engine-new/aphrodite/modeling/models/qwen2.py", line 387, in sample
next_tokens = self.sampler(logits, sampling_metadata)
File "/root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _call_impl
return forward_call(*args, **kwargs)
File "/root/aphrodite-engine-new/aphrodite/modeling/layers/sampler.py", line 552, in forward
maybe_deferred_sample_results, maybe_sampled_tokens_tensor = _sample(
File "/root/aphrodite-engine-new/aphrodite/modeling/layers/sampler.py", line 1610, in _sample
return _sample_with_torch(
File "/root/aphrodite-engine-new/aphrodite/modeling/layers/sampler.py", line 1469, in _sample_with_torch
sampling_type] = _top_k_top_p_multinomial_with_kernels(
File "/root/aphrodite-engine-new/aphrodite/modeling/layers/sampler.py", line 1324, in _top_k_top_p_multinomial_with_kernels
batch_next_token_ids, success = ops.top_k_top_p_sampling_from_probs(
File "/root/aphrodite-engine-new/aphrodite/_custom_ops.py", line 38, in wrapper
raise e
File "/root/aphrodite-engine-new/aphrodite/_custom_ops.py", line 29, in wrapper
return fn(*args, **kwargs)
File "/root/aphrodite-engine-new/aphrodite/_custom_ops.py", line 788, in top_k_top_p_sampling_from_probs
renorm_probs = top_k_renorm_prob(probs, top_k)
File "/root/aphrodite-engine-new/aphrodite/_custom_ops.py", line 38, in wrapper
raise e
File "/root/aphrodite-engine-new/aphrodite/_custom_ops.py", line 29, in wrapper
return fn(*args, **kwargs)
File "/root/aphrodite-engine-new/aphrodite/_custom_ops.py", line 751, in top_k_renorm_prob
return torch.ops._C.top_k_renorm_prob(probs,
File "/root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/_ops.py", line 1232, in getattr
raise AttributeError(
AttributeError: '_OpNamespace' '_C' object has no attribute 'top_k_renorm_prob'
[rank0]:[W405 19:49:10.062242905 ProcessGroupNCCL.cpp:1427] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
ERROR: RPCServer process died before responding to readiness probe
you need to set APHRODITE_TARGET_DEVICE="rocm" also make this change in your CMakeLists https://github.com/aphrodite-engine/aphrodite-engine/pull/1387/files
you need to set
APHRODITE_TARGET_DEVICE="rocm"also make this change in your CMakeLists https://github.com/aphrodite-engine/aphrodite-engine/pull/1387/files
thanks for your help, I will have a try later.
nvm don't change cmake lists just check out that branch and build it with APHRODITE_TARGET_DEVICE="rocm"
0.9.0 works, but 0.9.1 doesn't due to the new vectorized activation kernels being incompatible with ROCm. I will address this soon.