aphrodite-engine icon indicating copy to clipboard operation
aphrodite-engine copied to clipboard

[Installation]: Install v0.6.6/v0.6.7 on amd gpu gfx906 failed, v0.6.5 success but cannot run gptq

Open gengchaogit opened this issue 8 months ago • 6 comments

Your current environment

Hello ,

I can install 0.6.2post1 and 0.6.5 with rocm6.2.2 successful in my pc. But i meet some issue wher I'm running qwq 32b awq . So I want to build the newest v0.6.7 version but failed. I have rocm but the script wants to find cuda.

2 warnings generated when compiling for gfx906. In file included from /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:3, from /root/aphrodite-engine-new/kernels/flash_attn/flash_api.h:10, from /root/aphrodite-engine-new/kernels/torch_bindings.cpp:6: /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContextLight.h:6:10: fatal error: cuda_runtime_api.h: No such file or directory 6 | #include <cuda_runtime_api.h> | ^~~~~~~~~~~~~~~~~~~~ compilation terminated.

(aphroditenew) root@epyc:~/aphrodite-engine-new# python3 setup.py develop running develop /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/setuptools/command/develop.py:41: EasyInstallDeprecationWarning: easy_install command is deprecated. !!

    ********************************************************************************
    Please avoid running ``setup.py`` and ``easy_install``.
    Instead, use pypa/build, pypa/installer or other
    standards-based tools.

    See https://github.com/pypa/setuptools/issues/917 for details.
    ********************************************************************************

!! easy_install.initialize_options(self) /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/setuptools/_distutils/cmd.py:79: SetuptoolsDeprecationWarning: setup.py install is deprecated. !!

    ********************************************************************************
    Please avoid running ``setup.py`` directly.
    Instead, use pypa/build, pypa/installer or other
    standards-based tools.

    See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
    ********************************************************************************

!! self.initialize_options() running egg_info writing aphrodite_engine.egg-info/PKG-INFO writing dependency_links to aphrodite_engine.egg-info/dependency_links.txt writing entry points to aphrodite_engine.egg-info/entry_points.txt writing requirements to aphrodite_engine.egg-info/requires.txt writing top-level names to aphrodite_engine.egg-info/top_level.txt reading manifest file 'aphrodite_engine.egg-info/SOURCES.txt' reading manifest template 'MANIFEST.in' adding license file 'LICENSE' writing manifest file 'aphrodite_engine.egg-info/SOURCES.txt' running build_ext Using 64 CPUs as the number of jobs. -- The CXX compiler identification is GNU 12.3.0 -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Build type: RelWithDebInfo -- Target device: cuda -- Found Python: /root/miniconda3/envs/aphroditenew/bin/python3 (found version "3.10.16") found components: Interpreter Development.Module Development.SABIModule -- Found python matching: /root/miniconda3/envs/aphroditenew/bin/python3. Building PyTorch for GPU arch: gfx906 -- Found HIP: /opt/rocm-6.2.2 (found suitable version "6.2.41134-65d174c3e", minimum required is "1.0") HIP VERSION: 6.2.41134-65d174c3e

***** ROCm version from rocm_version.h ****

ROCM_VERSION_DEV: 6.2.2 ROCM_VERSION_DEV_MAJOR: 6 ROCM_VERSION_DEV_MINOR: 2 ROCM_VERSION_DEV_PATCH: 2 ROCM_VERSION_DEV_INT: 60202 HIP_VERSION_MAJOR: 6 HIP_VERSION_MINOR: 2 TORCH_HIP_VERSION: 602

***** Library versions from cmake find_package *****

-- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success -- Found Threads: TRUE hip VERSION: 6.2.41134 hsa-runtime64 VERSION: 1.14.60202 amd_comgr VERSION: 2.8.0 rocrand VERSION: 3.1.0 hiprand VERSION: 2.11.0 rocblas VERSION: 4.2.1 hipblas VERSION: 2.2.0 hipblaslt VERSION: 0.8.0 miopen VERSION: 3.2.0 hipfft VERSION: 1.0.15 hipsparse VERSION: 3.1.1 rccl VERSION: 2.20.5 rocprim VERSION: 3.2.0 hipcub VERSION: 3.2.0 rocthrust VERSION: 3.1.0 hipsolver VERSION: 2.2.0 CMake Deprecation Warning at /opt/rocm/lib/cmake/hiprtc/hiprtc-config.cmake:21 (cmake_minimum_required): Compatibility with CMake < 3.10 will be removed from a future version of CMake.

Update the VERSION argument value. Or, use the ... syntax to tell CMake that the project requires at least but has been updated to work with policies introduced by or earlier. Call Stack (most recent call first): /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/share/cmake/Caffe2/public/LoadHIP.cmake:56 (find_package) /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/share/cmake/Caffe2/public/LoadHIP.cmake:131 (find_package_and_print_version) /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:74 (include) /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package) CMakeLists.txt:70 (find_package)

hiprtc VERSION: 6.2.41134 HIP is using new type enums CMake Warning at /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message): static library kineto_LIBRARY-NOTFOUND not found. Call Stack (most recent call first): /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:121 (append_torchlib_if_found) CMakeLists.txt:70 (find_package)

-- Found Torch: /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/lib/libtorch.so -- Enabling core extension. -- The HIP compiler identification is Clang 18.0.0 -- Detecting HIP compiler ABI info -- Detecting HIP compiler ABI info - done -- Check for working HIP compiler: /opt/rocm-6.2.2/lib/llvm/bin/clang++ - skipped -- Detecting HIP compile features -- Detecting HIP compile features - done CMake Warning at CMakeLists.txt:154 (message): Pytorch version >= 2.5.0 expected for ROCm build, saw 2.6.0 instead.

-- HIP supported arches: gfx906;gfx908;gfx90a;gfx940;gfx941;gfx942;gfx1030;gfx1100;gfx1101 -- HIP target arches: gfx906;gfx906 -- Enabling C extension. -- Enabling moe extension. -- Enabling rocm extension. -- Configuring done (8.9s) -- Generating done (0.0s) -- Build files have been written to: /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310 Using 64 CPUs as the number of jobs. [ 50%] Building CXX object CMakeFiles/_core_C.dir/kernels/core/torch_bindings.cpp.o cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++ [100%] Linking CXX shared module /root/aphrodite-engine-new/build/lib.linux-x86_64-cpython-310/aphrodite/_core_C.abi3.so [100%] Built target _core_C [ 25%] Running hipify on _moe_C extension source files. /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/cuda_compat.h -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/hip_compat.h [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/moe/softmax.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/moe/softmax.hip [ok] Successfully preprocessed all matching files. Total number of unsupported CUDA function calls: 0

Total number of replaced kernel launches: 3 /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/moe/softmax.hip [ 25%] Built target hipify_moe_C [ 50%] Building HIP object CMakeFiles/_moe_C.dir/kernels/moe/softmax.hip.o [ 75%] Building CXX object CMakeFiles/_moe_C.dir/kernels/moe/torch_bindings.cpp.o cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++ [100%] Linking HIP shared module /root/aphrodite-engine-new/build/lib.linux-x86_64-cpython-310/aphrodite/_moe_C.abi3.so [100%] Built target _moe_C [ 6%] Running hipify on _C extension source files. /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/cuda_compat.h -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/hip_compat.h [skipped, already hipified] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/dispatch_utils.h -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/dispatch_utils.h [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/amd/hip_float8_impl.h -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/amd/hip_float8_impl.h [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/amd/hip_float8.h -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/amd/hip_float8.h [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/attention_generic.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/attention_generic.cuh [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/dtype_fp8.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/dtype_fp8.cuh [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/dtype_float32.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/dtype_float32.cuh [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/dtype_bfloat16.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/dtype_bfloat16_hip.cuh [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/amd/quant_utils.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/amd/quant_utils_hip.cuh [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/dtype_float16.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/dtype_float16.cuh [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/attention_dtypes.h -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/attention_dtypes_hip.h [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/nvidia/quant_utils.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/nvidia/quant_utils_hip.cuh [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/cache_kernels.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/cache_kernels.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/attention_utils.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/attention_utils_hip.cuh [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/attention_kernels.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/attention_kernels.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/pos_encoding_kernels.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/pos_encoding_kernels.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/activation_kernels.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/activation_kernels.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/layernorm_kernels.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/layernorm_kernels.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/squeezellm/quant_cuda_kernel.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/squeezellm/quant_hip_kernel.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/compat.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/compat.cuh [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/qdq_util.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/qdq_util.cuh [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/matrix_view.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/matrix_view_hip.cuh [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/qdq_2.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/qdq_2.cuh [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/qdq_3.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/qdq_3.cuh [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/qdq_4.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/qdq_4.cuh [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/qdq_8.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/qdq_8.cuh [skipped, no changes] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/q_gemm.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/q_gemm.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/compressed_tensors/int8_quant_kernels.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/compressed_tensors/int8_quant_kernels.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/common.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/common.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/cuda_utils_kernels.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/hip_utils_kernels.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/moe/align_block_size_kernel.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/moe/align_block_size_kernel.hip [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/prepare_inputs/advance_step.cuh -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/prepare_inputs/advance_step_hip.cuh [ok] /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/prepare_inputs/advance_step.cu -> /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/prepare_inputs/advance_step.hip [ok] Successfully preprocessed all matching files. Total number of unsupported CUDA function calls: 0

Total number of replaced kernel launches: 37 /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/cache_kernels.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/attention/attention_kernels.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/pos_encoding_kernels.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/activation_kernels.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/layernorm_kernels.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/squeezellm/quant_hip_kernel.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/gptq/q_gemm.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/compressed_tensors/int8_quant_kernels.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/quantization/fp8/common.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/hip_utils_kernels.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/moe/align_block_size_kernel.hip /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/prepare_inputs/advance_step.hip [ 6%] Built target hipify_C [ 13%] Building HIP object CMakeFiles/_C.dir/kernels/cache_kernels.hip.o [ 20%] Building HIP object CMakeFiles/_C.dir/kernels/attention/attention_kernels.hip.o [ 26%] Building HIP object CMakeFiles/_C.dir/kernels/pos_encoding_kernels.hip.o [ 33%] Building HIP object CMakeFiles/_C.dir/kernels/activation_kernels.hip.o [ 40%] Building HIP object CMakeFiles/_C.dir/kernels/layernorm_kernels.hip.o [ 46%] Building HIP object CMakeFiles/_C.dir/kernels/quantization/squeezellm/quant_hip_kernel.hip.o [ 53%] Building HIP object CMakeFiles/_C.dir/kernels/quantization/gptq/q_gemm.hip.o [ 60%] Building HIP object CMakeFiles/_C.dir/kernels/quantization/compressed_tensors/int8_quant_kernels.hip.o [ 66%] Building HIP object CMakeFiles/_C.dir/kernels/quantization/fp8/common.hip.o [ 73%] Building HIP object CMakeFiles/_C.dir/kernels/hip_utils_kernels.hip.o [ 80%] Building HIP object CMakeFiles/_C.dir/kernels/moe/align_block_size_kernel.hip.o [ 86%] Building CXX object CMakeFiles/_C.dir/kernels/torch_bindings.cpp.o [ 93%] Building HIP object CMakeFiles/_C.dir/kernels/prepare_inputs/advance_step.hip.o cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++ /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/hip_utils_kernels.hip:9:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result] 9 | hipGetDevice(&device); | ^~~~~~~~~~~~ ~~~~~~~ /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/hip_utils_kernels.hip:13:3: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result] 13 | hipDeviceGetAttribute(&value, static_cast<hipDeviceAttribute_t>(attribute), | ^~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14 | device); | ~~~~~~ 2 warnings generated when compiling for gfx906. In file included from /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContext.h:3, from /root/aphrodite-engine-new/kernels/flash_attn/flash_api.h:10, from /root/aphrodite-engine-new/kernels/torch_bindings.cpp:6: /root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/include/ATen/cuda/CUDAContextLight.h:6:10: fatal error: cuda_runtime_api.h: No such file or directory 6 | #include <cuda_runtime_api.h> | ^~~~~~~~~~~~~~~~~~~~ compilation terminated. /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/hip_utils_kernels.hip:9:5: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result] 9 | hipGetDevice(&device); | ^~~~~~~~~~~~ ~~~~~~~ /root/aphrodite-engine-new/build/temp.linux-x86_64-cpython-310/kernels/hip_utils_kernels.hip:13:3: warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result] 13 | hipDeviceGetAttribute(&value, static_cast<hipDeviceAttribute_t>(attribute), | ^~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 14 | device); | ~~~~~~ gmake[3]: *** [CMakeFiles/_C.dir/build.make:235: CMakeFiles/_C.dir/kernels/torch_bindings.cpp.o] Error 1 gmake[3]: *** Waiting for unfinished jobs.... 2 warnings generated when compiling for host. ^Cinterrupted gmake[3]: *** [CMakeFiles/_C.dir/build.make:78: CMakeFiles/_C.dir/kernels/cache_kernels.hip.o] Interrupt gmake[3]: *** [CMakeFiles/_C.dir/build.make:104: CMakeFiles/_C.dir/kernels/pos_encoding_kernels.hip.o] Interrupt gmake[3]: *** [CMakeFiles/_C.dir/build.make:91: CMakeFiles/_C.dir/kernels/attention/attention_kernels.hip.o] Interrupt gmake[3]: *** [CMakeFiles/_C.dir/build.make:117: CMakeFiles/_C.dir/kernels/activation_kernels.hip.o] Interrupt gmake[3]: *** [CMakeFiles/_C.dir/build.make:169: CMakeFiles/_C.dir/kernels/quantization/compressed_tensors/int8_quant_kernels.hip.o] Interrupt gmake[3]: *** [CMakeFiles/_C.dir/build.make:208: CMakeFiles/_C.dir/kernels/moe/align_block_size_kernel.hip.o] Interrupt gmake[3]: *** [CMakeFiles/_C.dir/build.make:221: CMakeFiles/_C.dir/kernels/prepare_inputs/advance_step.hip.o] Interrupt gmake[3]: *** [CMakeFiles/_C.dir/build.make:130: CMakeFiles/_C.dir/kernels/layernorm_kernels.hip.o] Interrupt gmake[3]: *** [CMakeFiles/_C.dir/build.make:143: CMakeFiles/_C.dir/kernels/quantization/squeezellm/quant_hip_kernel.hip.o] Interrupt gmake[3]: *** [CMakeFiles/_C.dir/build.make:156: CMakeFiles/_C.dir/kernels/quantization/gptq/q_gemm.hip.o] Interrupt gmake[3]: *** [CMakeFiles/_C.dir/build.make:182: CMakeFiles/_C.dir/kernels/quantization/fp8/common.hip.o] Interrupt gmake[2]: *** [CMakeFiles/Makefile2:202: CMakeFiles/_C.dir/all] Interrupt gmake[1]: *** [CMakeFiles/Makefile2:209: CMakeFiles/_C.dir/rule] Interrupt gmake: *** [Makefile:208: _C] Interrupt

How did you install Aphrodite?

pip install aphrodite-engine

gengchaogit avatar Apr 05 '25 19:04 gengchaogit

v0.6.5 build and install success but run model qwq 32b awq and gguf failed

gengchaogit avatar Apr 05 '25 19:04 gengchaogit

(aphroditenew) root@epyc:~# aphrodite run ~/windows/modelscope/QwQ-32B-AWQ/ --max-model-len 8192 --served-model-name aphrodite INFO: The model is convertible to awq_marlin during runtime. Using awq_marlin kernel. WARNING: awq_marlin kernels are temporarily disabled, they will be re-enabled with a future release. Falling back to AWQ kernels. WARNING: awq quantization is not fully optimized yet. The speed can be slower than non-quantized models. WARNING: Using AWQ quantization with ROCm, but APHRODITE_USE_TRITON_AWQ is not set, enabling APHRODITE_USE_TRITON_AWQ. INFO: Multiprocessing frontend to use ipc:///tmp/f3118921-e972-466c-a18e-ae9e85ae2aa4 for RPC Path. INFO: Started engine process with PID 1714518 INFO: The model is convertible to awq_marlin during runtime. Using awq_marlin kernel. WARNING: awq_marlin kernels are temporarily disabled, they will be re-enabled with a future release. Falling back to AWQ kernels. WARNING: awq quantization is not fully optimized yet. The speed can be slower than non-quantized models. WARNING: Using AWQ quantization with ROCm, but APHRODITE_USE_TRITON_AWQ is not set, enabling APHRODITE_USE_TRITON_AWQ. INFO: Disabled the custom all-reduce kernel because it is not supported on AMD GPUs. INFO: ------------------------------------------------------------------------------------- INFO: Initializing Aphrodite Engine (v0.6.5 commit cbd51a20) with the following config: INFO: Model = '/root/windows/modelscope/QwQ-32B-AWQ/' INFO: DataType = torch.float16 INFO: Tensor Parallel Size = 1 INFO: Pipeline Parallel Size = 1 INFO: Disable Custom All-Reduce = True INFO: Quantization Format = 'awq' INFO: Context Length = 8192 INFO: Enforce Eager Mode = False INFO: Prefix Caching = False INFO: Device = device(type='cuda') INFO: Guided Decoding Backend = DecodingConfig(guided_decoding_backend='lm-format-enforcer') INFO: Scheduler Steps = 1 INFO: Async Output Processing = True INFO: ------------------------------------------------------------------------------------- INFO: Using ROCmFlashAttention backend. [W405 19:46:07.499544169 socket.cpp:759] [c10d] The client socket cannot be initialized to connect to [::ffff:192.168.2.243]:34505 (errno: 97 - Address family not supported by protocol). INFO: Loading model /root/windows/modelscope/QwQ-32B-AWQ/... WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 INFO: Using ROCmFlashAttention backend. ⠇ Loading model weights... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 100% 18.00/18.00 GiB 0:00:11 INFO: Model weights loaded in 12.36 seconds. INFO: Total model weights memory usage: 18.15 GiB INFO: Profiling peak memory usage... WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 WARNING: Model architecture Qwen2ForCausalLM is partially supported by ROCm: Sliding window attention (SWA) is not yet supported in Triton flash attention. For half-precision SWA support, please use CK flash attention by setting APHRODITE_USE_TRITON_FLASH_ATTN=0 /root/aphrodite-engine-new/aphrodite/quantization/awq.py:163: UserWarning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:296.) out = torch.matmul(reshaped_x, out) ERROR: Error in calling custom op top_k_renorm_prob: '_OpNamespace' '_C' object has no attribute 'top_k_renorm_prob' ERROR: Possibly you have built or installed an obsolete version of aphrodite. ERROR: Please try a clean build and install of aphrodite,or remove old built files such as aphrodite/.so and build/ . ERROR: Error in calling custom op top_k_top_p_sampling_from_probs: '_OpNamespace' '_C' object has no attribute 'top_k_renorm_prob' ERROR: Possibly you have built or installed an obsolete version of aphrodite. ERROR: Please try a clean build and install of aphrodite,or remove old built files such as aphrodite/.so and build/ . Process SpawnProcess-1: Traceback (most recent call last): File "/root/miniconda3/envs/aphroditenew/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/root/miniconda3/envs/aphroditenew/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/root/aphrodite-engine-new/aphrodite/endpoints/openai/rpc/server.py", line 229, in run_rpc_server server = AsyncEngineRPCServer(async_engine_args, rpc_path) File "/root/aphrodite-engine-new/aphrodite/endpoints/openai/rpc/server.py", line 39, in init self.engine = AsyncAphrodite.from_engine_args(async_engine_args) File "/root/aphrodite-engine-new/aphrodite/engine/async_aphrodite.py", line 741, in from_engine_args engine = cls( File "/root/aphrodite-engine-new/aphrodite/engine/async_aphrodite.py", line 630, in init self.engine = self._init_engine(*args, **kwargs) File "/root/aphrodite-engine-new/aphrodite/engine/async_aphrodite.py", line 840, in _init_engine return engine_class(*args, **kwargs) File "/root/aphrodite-engine-new/aphrodite/engine/async_aphrodite.py", line 263, in init super().init(*args, **kwargs) File "/root/aphrodite-engine-new/aphrodite/engine/aphrodite_engine.py", line 307, in init self._initialize_kv_caches() File "/root/aphrodite-engine-new/aphrodite/engine/aphrodite_engine.py", line 399, in _initialize_kv_caches self.model_executor.determine_num_available_blocks()) File "/root/aphrodite-engine-new/aphrodite/executor/gpu_executor.py", line 111, in determine_num_available_blocks return self.driver_worker.determine_num_available_blocks() File "/root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/root/aphrodite-engine-new/aphrodite/task_handler/worker.py", line 199, in determine_num_available_blocks self.model_runner.profile_run() File "/root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/root/aphrodite-engine-new/aphrodite/task_handler/model_runner.py", line 1180, in profile_run self.execute_model(model_input, kv_caches, intermediate_tensors) File "/root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/root/aphrodite-engine-new/aphrodite/task_handler/model_runner.py", line 1522, in execute_model output: SamplerOutput = self.model.sample( File "/root/aphrodite-engine-new/aphrodite/modeling/models/qwen2.py", line 387, in sample next_tokens = self.sampler(logits, sampling_metadata) File "/root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1740, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _call_impl return forward_call(*args, **kwargs) File "/root/aphrodite-engine-new/aphrodite/modeling/layers/sampler.py", line 552, in forward maybe_deferred_sample_results, maybe_sampled_tokens_tensor = _sample( File "/root/aphrodite-engine-new/aphrodite/modeling/layers/sampler.py", line 1610, in _sample return _sample_with_torch( File "/root/aphrodite-engine-new/aphrodite/modeling/layers/sampler.py", line 1469, in _sample_with_torch sampling_type] = _top_k_top_p_multinomial_with_kernels( File "/root/aphrodite-engine-new/aphrodite/modeling/layers/sampler.py", line 1324, in _top_k_top_p_multinomial_with_kernels batch_next_token_ids, success = ops.top_k_top_p_sampling_from_probs( File "/root/aphrodite-engine-new/aphrodite/_custom_ops.py", line 38, in wrapper raise e File "/root/aphrodite-engine-new/aphrodite/_custom_ops.py", line 29, in wrapper return fn(*args, **kwargs) File "/root/aphrodite-engine-new/aphrodite/_custom_ops.py", line 788, in top_k_top_p_sampling_from_probs renorm_probs = top_k_renorm_prob(probs, top_k) File "/root/aphrodite-engine-new/aphrodite/_custom_ops.py", line 38, in wrapper raise e File "/root/aphrodite-engine-new/aphrodite/_custom_ops.py", line 29, in wrapper return fn(*args, **kwargs) File "/root/aphrodite-engine-new/aphrodite/_custom_ops.py", line 751, in top_k_renorm_prob return torch.ops._C.top_k_renorm_prob(probs, File "/root/miniconda3/envs/aphroditenew/lib/python3.10/site-packages/torch/_ops.py", line 1232, in getattr raise AttributeError( AttributeError: '_OpNamespace' '_C' object has no attribute 'top_k_renorm_prob' [rank0]:[W405 19:49:10.062242905 ProcessGroupNCCL.cpp:1427] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) ERROR: RPCServer process died before responding to readiness probe

gengchaogit avatar Apr 05 '25 19:04 gengchaogit

you need to set APHRODITE_TARGET_DEVICE="rocm" also make this change in your CMakeLists https://github.com/aphrodite-engine/aphrodite-engine/pull/1387/files

ewof avatar May 03 '25 21:05 ewof

you need to set APHRODITE_TARGET_DEVICE="rocm" also make this change in your CMakeLists https://github.com/aphrodite-engine/aphrodite-engine/pull/1387/files

thanks for your help, I will have a try later.

gengchaogit avatar May 03 '25 21:05 gengchaogit

nvm don't change cmake lists just check out that branch and build it with APHRODITE_TARGET_DEVICE="rocm"

ewof avatar May 04 '25 05:05 ewof

0.9.0 works, but 0.9.1 doesn't due to the new vectorized activation kernels being incompatible with ROCm. I will address this soon.

AlpinDale avatar Sep 27 '25 16:09 AlpinDale