mlc-llm
mlc-llm copied to clipboard
KVCachw expects [Bug]
bug: 2024-04-02 23:28:42.699 24048-24662 TVM_RUNTIME ai.mlc.mlcchat A /mlc-llm/3rdparty/tvm/include/tvm/runtime/packed_func.h:1908: Function vm.builtin.paged_attention_kv_cache_create_reduced(0: runtime.ShapeTuple, 1: int64_t, 2: int64_t, 3: int64_t, 4: int64_t, 5: int, 6: double, 7: double, 8: runtime.NDArray, 9: runtime.PackedFunc, 10: runtime.PackedFunc, 11: runtime.PackedFunc, 12: runtime.PackedFunc, 13: runtime.PackedFunc, 14: runtime.PackedFunc, 15: runtime.PackedFunc, 16: runtime.PackedFunc, 17: runtime.PackedFunc, 18: runtime.PackedFunc) -> relax.vm.AttentionKVCache expects 19 arguments, but 18 were provided.
python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))" USE_NVTX: OFF USE_GTEST: AUTO SUMMARIZE: OFF TVM_DEBUG_WITH_ABI_CHANGE: OFF USE_IOS_RPC: OFF USE_MSC: OFF USE_ETHOSU: CUDA_VERSION: NOT-FOUND USE_LIBBACKTRACE: AUTO DLPACK_PATH: 3rdparty/dlpack/include USE_TENSORRT_CODEGEN: OFF USE_THRUST: OFF USE_TARGET_ONNX: OFF USE_AOT_EXECUTOR: ON BUILD_DUMMY_LIBTVM: OFF USE_CUDNN: OFF USE_TENSORRT_RUNTIME: OFF USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR: OFF USE_CCACHE: AUTO USE_ARM_COMPUTE_LIB: OFF USE_CPP_RTVM: USE_OPENCL_GTEST: /path/to/opencl/gtest USE_MKL: OFF USE_PT_TVMDSOOP: OFF MLIR_VERSION: NOT-FOUND USE_CLML: OFF USE_STACKVM_RUNTIME: OFF USE_GRAPH_EXECUTOR_CUDA_GRAPH: OFF ROCM_PATH: /opt/rocm USE_DNNL: OFF USE_VITIS_AI: OFF USE_MLIR: OFF USE_RCCL: OFF USE_LLVM: llvm-config --ignore-libllvm --link-static USE_VERILATOR: OFF USE_TF_TVMDSOOP: OFF USE_THREADS: ON USE_MSVC_MT: OFF BACKTRACE_ON_SEGFAULT: OFF USE_GRAPH_EXECUTOR: ON USE_NCCL: OFF USE_ROCBLAS: OFF GIT_COMMIT_HASH: 4bdcf425f762080614a218b229b3e4c38828bca0 USE_VULKAN: ON USE_RUST_EXT: OFF USE_CUTLASS: OFF USE_CPP_RPC: OFF USE_HEXAGON: OFF USE_CUSTOM_LOGGING: OFF USE_UMA: OFF USE_FALLBACK_STL_MAP: OFF USE_SORT: ON USE_RTTI: ON GIT_COMMIT_TIME: 2024-03-31 23:01:12 -0400 USE_HEXAGON_SDK: /path/to/sdk USE_BLAS: none USE_ETHOSN: OFF USE_LIBTORCH: OFF USE_RANDOM: ON USE_CUDA: OFF USE_COREML: OFF USE_AMX: OFF BUILD_STATIC_RUNTIME: OFF USE_CMSISNN: OFF USE_KHRONOS_SPIRV: OFF USE_CLML_GRAPH_EXECUTOR: OFF USE_TFLITE: OFF USE_HEXAGON_GTEST: /path/to/hexagon/gtest PICOJSON_PATH: 3rdparty/picojson USE_OPENCL_ENABLE_HOST_PTR: OFF INSTALL_DEV: OFF USE_PROFILER: ON USE_NNPACK: OFF LLVM_VERSION: 15.0.7 USE_MRVL: OFF USE_OPENCL: OFF COMPILER_RT_PATH: 3rdparty/compiler-rt RANG_PATH: 3rdparty/rang/include USE_SPIRV_KHR_INTEGER_DOT_PRODUCT: OFF USE_OPENMP: OFF USE_BNNS: OFF USE_FLASHINFER: USE_CUBLAS: OFF USE_METAL: OFF USE_MICRO_STANDALONE_RUNTIME: OFF USE_HEXAGON_EXTERNAL_LIBS: OFF USE_ALTERNATIVE_LINKER: AUTO USE_BYODT_POSIT: OFF USE_HEXAGON_RPC: OFF USE_MICRO: OFF DMLC_PATH: 3rdparty/dmlc-core/include INDEX_DEFAULT_I64: ON USE_RELAY_DEBUG: OFF USE_RPC: ON USE_TENSORFLOW_PATH: none TVM_CLML_VERSION: USE_MIOPEN: OFF USE_ROCM: OFF USE_PAPI: OFF USE_CURAND: OFF TVM_CXX_COMPILER_PATH: /opt/rh/gcc-toolset-11/root/usr/bin/c++ HIDE_PRIVATE_SYMBOLS: ON
Hi @Vinaysukhesh98 thank you for reporting. Could you re-compile the model with python -m mlc_llm compile ...? Also see this thread where the same issue happens. I believe model recompilation can help.
Ha still same
I see. To help confirm the issue, could you help check the file mlc_llm/nn/kv_cache.py on your local side and see whether the Line 351 is the following? If it is not the following, it means your mlc_llm is not up to date, and please update mlc_llm to the latest nightly or the latests commit on the main branch.
bb.add_func(_copy_single_page(num_key_value_heads, page_size, head_dim, dtype, target), "kv_cache_copy_single_page"),
https://github.com/mlc-ai/mlc-llm/blob/5bc3ffa6f682a4cf42fdeba3a4c505d0e7c08c3c/python/mlc_llm/nn/kv_cache.py#L351
I see. To help confirm the issue, could you help check the file
mlc_llm/nn/kv_cache.pyon your local side and see whether the Line 351 is the following? If it is not the following, it means your mlc_llm is not up to date, and please updatemlc_llmto the latest nightly or the latests commit on the main branch.bb.add_func(_copy_single_page(num_key_value_heads, page_size, head_dim, dtype, target), "kv_cache_copy_single_page"),https://github.com/mlc-ai/mlc-llm/blob/5bc3ffa6f682a4cf42fdeba3a4c505d0e7c08c3c/python/mlc_llm/nn/kv_cache.py#L351
bb.add_func(_copy_single_page(num_key_value_heads, page_size, head_dim, dtype, target), "kv_cache_copy_single_page"),
mlc_llm version pip show mlc_llm Name: mlc-llm Version: 0.1.dev1068+gb7416c02 Summary: MLC LLM: an universal LLM deployment engine via ML compilation. Home-page: https://llm.mlc.ai/ Author: MLC LLM Contributors Author-email: License: Apache 2.0 Location: /home/mbuhyd/Documents/mlc-llm/python Editable project location: /home/mbuhyd/Documents/mlc-llm/python Requires: fastapi, openai, prompt_toolkit, requests, safetensors, shortuuid, tiktoken, torch, tqdm, uvicorn Required-by:
Fwiw, I am getting the same error with the following config:
USE_NVTX: OFF
USE_GTEST: AUTO
SUMMARIZE: OFF
TVM_DEBUG_WITH_ABI_CHANGE: OFF
USE_IOS_RPC: OFF
USE_MSC: OFF
USE_ETHOSU:
CUDA_VERSION: NOT-FOUND
USE_LIBBACKTRACE: AUTO
DLPACK_PATH: 3rdparty/dlpack/include
USE_TENSORRT_CODEGEN: OFF
USE_THRUST: OFF
USE_TARGET_ONNX: OFF
USE_AOT_EXECUTOR: ON
BUILD_DUMMY_LIBTVM: OFF
USE_CUDNN: OFF
USE_TENSORRT_RUNTIME: OFF
USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR: OFF
USE_CCACHE: AUTO
USE_ARM_COMPUTE_LIB: OFF
USE_CPP_RTVM:
USE_OPENCL_GTEST: /path/to/opencl/gtest
USE_MKL: OFF
USE_PT_TVMDSOOP: OFF
MLIR_VERSION: NOT-FOUND
USE_CLML: OFF
USE_STACKVM_RUNTIME: OFF
USE_GRAPH_EXECUTOR_CUDA_GRAPH: OFF
ROCM_PATH: /opt/rocm
USE_DNNL: OFF
USE_VITIS_AI: OFF
USE_MLIR: OFF
USE_RCCL: OFF
USE_LLVM: ON
USE_VERILATOR: OFF
USE_TF_TVMDSOOP: OFF
USE_THREADS: ON
USE_MSVC_MT: OFF
BACKTRACE_ON_SEGFAULT: OFF
USE_GRAPH_EXECUTOR: ON
USE_NCCL: OFF
USE_ROCBLAS: OFF
GIT_COMMIT_HASH: 5400532c4ba37e8a30fcaac488c2ecb05a307e4f
USE_VULKAN: OFF
USE_RUST_EXT: OFF
USE_CUTLASS: OFF
USE_CPP_RPC: OFF
USE_HEXAGON: OFF
USE_CUSTOM_LOGGING: OFF
USE_UMA: OFF
USE_FALLBACK_STL_MAP: OFF
USE_SORT: ON
USE_RTTI: ON
GIT_COMMIT_TIME: 2024-03-30 17:34:21 -0400
USE_HEXAGON_SDK: /path/to/sdk
USE_BLAS: none
USE_ETHOSN: OFF
USE_LIBTORCH: OFF
USE_RANDOM: ON
USE_CUDA: OFF
USE_COREML: OFF
USE_AMX: OFF
BUILD_STATIC_RUNTIME: OFF
USE_CMSISNN: OFF
USE_KHRONOS_SPIRV: OFF
USE_CLML_GRAPH_EXECUTOR: OFF
USE_TFLITE: OFF
USE_HEXAGON_GTEST: /path/to/hexagon/gtest
PICOJSON_PATH: 3rdparty/picojson
USE_OPENCL_ENABLE_HOST_PTR: OFF
INSTALL_DEV: OFF
USE_PROFILER: ON
USE_NNPACK: OFF
LLVM_VERSION: 18.1.3
USE_MRVL: OFF
USE_OPENCL: OFF
COMPILER_RT_PATH: 3rdparty/compiler-rt
RANG_PATH: 3rdparty/rang/include
USE_SPIRV_KHR_INTEGER_DOT_PRODUCT: OFF
USE_OPENMP: OFF
USE_BNNS: OFF
USE_FLASHINFER:
USE_CUBLAS: OFF
USE_METAL: OFF
USE_MICRO_STANDALONE_RUNTIME: OFF
USE_HEXAGON_EXTERNAL_LIBS: OFF
USE_ALTERNATIVE_LINKER: AUTO
USE_BYODT_POSIT: OFF
USE_HEXAGON_RPC: OFF
USE_MICRO: OFF
DMLC_PATH: 3rdparty/dmlc-core/include
INDEX_DEFAULT_I64: ON
USE_RELAY_DEBUG: OFF
USE_RPC: ON
USE_TENSORFLOW_PATH: none
TVM_CLML_VERSION:
USE_MIOPEN: OFF
USE_ROCM: OFF
USE_PAPI: OFF
USE_CURAND: OFF
TVM_CXX_COMPILER_PATH: /usr/bin/c++
HIDE_PRIVATE_SYMBOLS: OFF
I have just cloned the repo, so everything should be up to date.
@Vinaysukhesh98 @sygi could you share your commands to run things, together with the full logging? From the information we have so far I am unable to find the reason. Thanks in ahead!
- clone the repo:
git clone --recursive https://github.com/mlc-ai/mlc-llm/ - add tvm, java, android, etc to bashrc
- compile tvm:
chmod +x 3rdparty/libbacktrace/configure && \ mkdir -p cmake-build && \ cmake -H. -Bcmake-build -DUSE_LLVM=ON && \ cmake --build cmake-build --target all -- -j 4 && \ mv cmake-build build - download configs etc for android + llama 7B f16 from here.
- start virtualenv, install dependencies (
attrs,numpy,typing_extensions,psutil,decorator) - go to
android/library, add:
set(JAVA_AWT_LIBRARY NotNeeded)
set(JAVA_JVM_LIBRARY NotNeeded)
set(JAVA_INCLUDE_PATH2 NotNeeded)
set(JAVA_AWT_INCLUDE_PATH NotNeeded)
to CMakeLists.txt
7. modify app-config.json to only include the llama model and point to my android.tar.
8. Run prepare_libs.sh, make sure the libraries appeared.
9. Compile things in android studio, send to device.
10. Download the weights, start conversation.
Hello, same issue encountered when building and testing the android app as instructed in https://llm.mlc.ai/docs/deploy/android.html. The error message shows as the chatting UI initializes.
We use prebuilt models and libs:
- https://huggingface.co/mlc-ai/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
- https://huggingface.co/mlc-ai/phi-2-q4f16_1-MLC
- https://github.com/mlc-ai/binary-mlc-llm-libs
We use the latest prebuilt packages:
- mlc_llm v0.1.dev0
- tvm 0.16.dev0.
We are attempting to fix this with earlier versions. No prebuilt packages are provided, so we have to build from source.
Hello, same issue encountered when building and testing the android app as instructed in llm.mlc.ai/docs/deploy/android.html. The error message shows as the chatting UI initializes.
We use prebuilt models and libs:
- huggingface.co/mlc-ai/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
- huggingface.co/mlc-ai/phi-2-q4f16_1-MLC
- mlc-ai/binary-mlc-llm-libs
We use the latest prebuilt packages:
- mlc_llm v0.1.dev0
- tvm 0.16.dev0.
We are attempting to fix this with earlier versions. No prebuilt packages are provided, so we have to build from source.
Update:
Still no luck. We've managed to build the android .tar with tvm v0.15 (after fixing some version mismatch issues with mlc and llvm...), but it seems not compatible with the latest prebuilt weights and libs, and we got error message as in #2031. It looks like the deployment workflow is undergoing significant changes recently, and we are not able to find a walkaround for now by ourselves.
We understand that the android pipeline might not be the priority, but still appreciate it if you could look at this issue and provide a valid android building & deploying pipeline :)
I actually looked into #2031, @xxxxyu a bit and it looks like the config format has changed a bit recently. Can you confirm that you have the last version of mlc-chat-config (which has conv_template as a string), as well as a new version of tvm (which parses the object here)?
@sygi Hi, the mlc-chat-config is exactly the same as you provided, but mlc/tvm version is earlier.
We actually tested the following 3 settings (windows + wsl + Pixel 6 Pro):
- latest mlc + tvm 0.16: bug reported in this issue .
- mlc commit #1659 + tvm 0.15 (built from source): bug reported by #2031.
- latest mlc + tvm 0.15: mismatch.
I also encountered the same problem, which can be solved by modifying mlc-llm/3rdparty/tvm/src/runtime/relax_vm/paged_kv_cache.cc : line 1814: TVM_REGISTER_GLOBAL("vm.builtin.paged_attention_kv_cache_create_reduced") .set_body_typed([](ShapeTuple cache_config, int64_t num_layers, int64_t num_qo_heads, int64_t num_kv_heads, int64_t head_dim, int rope_mode, double rotary_scale, double rotary_theta, NDArray init, PackedFunc f_transpose_append, PackedFunc f_attention_prefill, PackedFunc f_attention_decode, PackedFunc f_attention_prefill_sliding_window, PackedFunc f_attention_decode_sliding_window, PackedFunc f_attention_prefill_ragged, PackedFunc f_merge_inplace, PackedFunc f_split_rotary, PackedFunc f_copy_single_page) { //Optional<PackedFunc> f_debug_get_kv CHECK_EQ(cache_config.size(), 5); int64_t reserved_num_seqs = cache_config[0]; int64_t total_token_capacity = cache_config[1]; int64_t prefill_chunk_size = cache_config[2]; int64_t page_size = cache_config[3]; bool support_sliding_window = cache_config[4]; int64_t num_total_pages = (total_token_capacity + page_size - 1) / page_size + 1; if (support_sliding_window) { // When sliding window is enabled, each sequence may use two more pages at most. num_total_pages += reserved_num_seqs * 2; } ObjectPtr<PagedAttentionKVCacheObj> n = make_object<PagedAttentionKVCacheObj>( page_size, num_layers, num_qo_heads, num_kv_heads, head_dim, reserved_num_seqs, num_total_pages, prefill_chunk_size, support_sliding_window, RoPEMode(rope_mode), rotary_scale, rotary_theta, init->dtype, init->device, std::move(f_transpose_append), std::move(f_attention_prefill), std::move(f_attention_decode), std::move(f_attention_prefill_sliding_window), std::move(f_attention_decode_sliding_window), std::move(f_attention_prefill_ragged), // NullOpt, NullOpt, NullOpt, NullOpt, NullOpt, NullOpt, // std::move(f_merge_inplace), std::move(f_split_rotary), std::move(f_copy_single_page), NullOpt); //std::move(f_debug_get_kv)); return AttentionKVCache(std::move(n)); });
https://github.com/mlc-ai/mlc-llm/issues/2076#issuecomment-2056249208 works fine for me, thanks!
Update:
It seems this only works for the prebuilt libs. When I want to compile with customized configurations, the issue still exists. I'm using the latest mlc-llm, both prebuilt and build-from-source ones failed. So there might be something with the compilation codes too, could you guys confirm?
Not sure if this is what @xxxxyu meant, but for me:
- the compiler for tvm wasn't able to infer the types until I changed to:
make_object<PagedAttentionKVCacheObj>
- When I ran
./prepare_libs.sh, I got:
In file included from /home/sygi/code/mlc-llm2/cpp/llm_chat.cc:26:
/home/sygi/code/mlc-llm2/cpp/./metadata/model.h:18:7: error: typedef redefinition with different types ('std::unordered_map<std::string, value>' (aka 'unordered_map<basic_string<char>, picojson::value>') vs 'value::object' (aka 'picojson::object_with_ordered_keys'))
using object = std::unordered_map<std::string, value>;
^
/home/sygi/code/mlc-llm2/3rdparty/tvm/3rdparty/picojson/picojson.h:326:23: note: previous definition is here
typedef value::object object;
Hi @sygi
- Regarding the
make_objecttype error, I've fixed it as you did to make it work. - I didn't encounter any other issue at compiling time, including the redefinition error you mentioned. You might need to check if the brackets still match after replacing the codes.
The error I mentioned in the update shows only at runtime, and only when I attempt to compile the model libs on my own (the prebuilt model libs on https://github.com/mlc-ai/binary-mlc-llm-libs are OK), and the error message is still "...relax.vm.AttentionKVCache expects 19 arguments, but 18 were provided".
Thank you for confirming. To clarify, my error doesn't appear during compilation of tvm (which runs fine), but when running ./prepare_libs.sh from the android folder (presumably during linking with picojson). Did you also do this? Could you confirm that you're at 0f67508 commit of tvm submodule?
Fwiw, I only want to do the prebuilt libraries, compiling them yourself seems like another ordeal ^^
@sygi
I'm using mlc-ai-nightly==0.15.dev275, which should contain a prebuilt tvm package. I didn't compile tvm myself either, I only attempted to compile mlc-llm myself.
python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))" gives: GIT_COMMIT_HASH: ae057a2e74e895a846df958c19ff342505131a65.
./prepare_libs.sh was successful to me.
BTW, I've noticed a small difference between our codes. You might want to try ObjectPtr<PagedAttentionKVCacheObj> n = make_object<PagedAttentionKVCacheObj>? But I'm not sure if this matters :)
latest android sdk might help address related issues https://llm.mlc.ai/docs/deploy/android.html
Hit the same issue with the latest code at 9998076153d5309ec87dc32c373e1759813ee84e for iOS app with a customized llama model
Function vm.builtin.paged_attention_kv_cache_create_reduced(0: runtime.ShapeTuple, 1: int64_t, 2: int64_t, 3: int64_t, 4: int64_t, 5: int, 6: double, 7: double, 8: runtime.NDArray, 9: runtime.PackedFunc, 10: runtime.PackedFunc, 11: runtime.PackedFunc, 12: runtime.PackedFunc, 13: runtime.PackedFunc, 14: runtime.PackedFunc, 15: runtime.PackedFunc, 16: runtime.PackedFunc, 17: runtime.PackedFunc, 18: runtime.PackedFunc) -> relax.vm.AttentionKVCache expects 19 arguments, but 18 were provided.