[Bug] chatglm4 mlc_llm shows error "TVMError: Check failed: append_length > 0 (0 vs. 0) : Append with length 0 is not allowed." during mlc_llm chat CLI
mlc-ai-nightly-cu122 0.15.dev404 mlc-llm-nightly-cu122 0.1.dev1355 transformers 4.41.2
git clone https://huggingface.co/THUDM/glm-4-9b-chat mlc_llm convert_weight ./dist/models/glm-4-9b-chat/ --quantization q4f16_1 -o dist/glm-4-9b-chat-MLC
mlc_llm gen_config ./dist/models/glm-4-9b-chat/ --quantization q4f16_1 --conv-template glm -o dist/glm-4-9b-chat-MLC/
It shows
The repository for dist/models/glm-4-9b-chat contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/dist/models/glm-4-9b-chat.
You can avoid this prompt in future by passing the argument trust_remote_code=True.
Do you wish to run the custom code? [y/N] y Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Add trust_remote_code=True fast_tokenizer = AutoTokenizer.from_pretrained(str(config.parent), use_fast=True, trust_remote_code=True)
It shows error AttributeError: 'ChatGLM4Tokenizer' object has no attribute 'backend_tokenizer' /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:154: Warning: Tokenizer info is not detected as tokenizer.json is not found. The default tokenizer info will be used. Segmentation fault (core dumped)
mlc_chat_config.tokenizer_info = asdict(Tokenizer.detect_tokenizer_info(str(output))) run into Segmentation fault
I'm not sure but GLM may use a customized tokenizer which is not supported yet
I'm not sure but GLM may use a customized tokenizer which is not supported yet
https://github.com/mlc-ai/mlc-llm/pull/1313 mentioned chatglm3 back, but I tried chatglm3-6b, it show same error
That is related to a recent change in tokenizer in #2416. We will fix that soon
See #2532
@Ubospica thanks! I just tested latest package mlc-ai-nightly-cu122 0.15.dev404 mlc-llm-nightly-cu122 0.1.dev1382
mlc_llm gen_config ./dist/models/glm-4-9b-chat/ --quantization q4f16_1 --conv-template glm -o dist/glm-4-9b-chat-MLC/ works now But after compilation mlc_llm compile ./dist/glm-4-9b-chat-MLC/mlc-chat-config.json --device cuda --quantization q4f16_1 --model-type chatglm --output ./dist/libs/glm-4-9b-chat/glm-4-9b-chat-cuda.so
If running with cli mlc_llm chat ./dist/glm-4-9b-chat-MLC/ --device "cuda" --model-lib dist/libs/glm-4-9b-chat/glm-4-9b-chat-cuda.so
It shows error like below
mlc_llm chat ./dist/glm-4-9b-chat-MLC/ --device "cuda" --model-lib dist/libs/glm-4-9b-chat/glm-4-9b-chat-cuda.so
[2024-06-08 07:04:34] INFO auto_device.py:79: Found device: cuda:0
[2024-06-08 07:04:34] INFO engine_base.py:143: Using library model: dist/libs/glm-4-9b-chat/glm-4-9b-chat-cuda.so
[07:04:34] /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:154: Warning: Tokenizer info is not detected as tokenizer.json is not found. The default tokenizer info will be used.
[07:04:34] /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:130: Warning: Using tokenizer.model since we cannot locate tokenizer.json.
It is recommended to use tokenizer.json to ensure all token mappings are included, since currently, files like added_tokens.json, tokenizer_config.json are ignored.
Consider converting tokenizer.model to tokenizer.json by compiling the model with MLC again, or see if MLC's huggingface provides this file.
[07:04:34] /workspace/mlc-llm/cpp/serve/config.cc:649: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 8192, prefill chunk size will be set to 2048.
[07:04:34] /workspace/mlc-llm/cpp/serve/config.cc:649: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 131072, prefill chunk size will be set to 2048.
[07:04:34] /workspace/mlc-llm/cpp/serve/config.cc:649: Under mode "server", max batch size will be set to 80, max KV cache token capacity will be set to 314995, prefill chunk size will be set to 2048.
[07:04:34] /workspace/mlc-llm/cpp/serve/config.cc:729: The actual engine mode is "interactive". So max batch size is 1, max KV cache token capacity is 131072, prefill chunk size is 2048.
[07:04:34] /workspace/mlc-llm/cpp/serve/config.cc:734: Estimated total single GPU memory usage: 13215.253 MB (Parameters: 5043.234 MB. KVCache: 5202.672 MB. Temporary buffer: 2969.346 MB). The actual usage might be slightly larger than the estimated number.
[07:04:36] /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:154: Warning: Tokenizer info is not detected as tokenizer.json is not found. The default tokenizer info will be used.
[07:04:36] /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:130: Warning: Using tokenizer.model since we cannot locate tokenizer.json.
It is recommended to use tokenizer.json to ensure all token mappings are included, since currently, files like added_tokens.json, tokenizer_config.json are ignored.
Consider converting tokenizer.model to tokenizer.json by compiling the model with MLC again, or see if MLC's huggingface provides this file.
sentencepiece_processor.cc(922) LOG(ERROR) 3rdparty/tokenizers-cpp/sentencepiece/src/sentencepiece_processor.cc(289) [model_] Model is not initialized.
Returns default value 0
[07:04:36] /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:154: Warning: Tokenizer info is not detected as tokenizer.json is not found. The default tokenizer info will be used.
[07:04:36] /workspace/mlc-llm/cpp/tokenizers/tokenizers.cc:130: Warning: Using tokenizer.model since we cannot locate tokenizer.json.
It is recommended to use tokenizer.json to ensure all token mappings are included, since currently, files like added_tokens.json, tokenizer_config.json are ignored.
Consider converting tokenizer.model to tokenizer.json by compiling the model with MLC again, or see if MLC's huggingface provides this file.
You can use the following special commands:
/help print the special commands
/exit quit the cli
/stats print out stats of last request (token/sec)
/metrics print out full engine metrics
/reset restart a fresh chat
/set [overrides] override settings in the generation config. For example,
/set temperature=0.5;top_p=0.8;seed=23;max_tokens=100;stop=str1,str2
Note: Separate stop words in the stop option with commas (,).
Multi-line input: Use escape+enter to start a new line.
Exception in thread Thread-1:
Traceback (most recent call last):
File "/home/haoli/anaconda3.11-GPU_new_mlc2/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
self.run()
File "/home/haoli/anaconda3.11-GPU_new_mlc2/lib/python3.11/threading.py", line 975, in run
self._target(self._args, self._kwargs)
File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.call
File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
File "/home/haoli/anaconda3.11-GPU_new_mlc2/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
File "/workspace/mlc-llm/cpp/serve/threaded_engine.cc", line 182, in mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()
File "/workspace/mlc-llm/cpp/serve/engine.cc", line 619, in mlc::llm::serve::EngineImpl::Step()
File "/workspace/mlc-llm/cpp/serve/engine_actions/new_request_prefill.cc", line 116, in mlc::llm::serve::NewRequestPrefillActionObj::Step(mlc::llm::serve::EngineState)
File "/workspace/mlc-llm/cpp/serve/model.cc", line 232, in mlc::llm::serve::ModelImpl::BatchPrefill(tvm::runtime::ObjectRef const&, std::vector<long, std::allocator
@lihaofd thanks for reporting. We'll look into the issue on glm 4. Meanwhile, would you mind confirming if this issue does not happen for other models like llama?
@MasterJH5574 I have tried chatglm3-6b, it will not show error "TVMError: Check failed: append_length > 0 (0 vs. 0) : Append with length 0 is not allowed.", but the output is abnormal like mlc_llm chat ./dist/chatglm3-6b-MLC/ --device "cuda" --model-lib dist/libs/chatglm3-6b/chatglm3-6b-cuda.so
please introduce shanghai
My name is xxx, and I am a school school
- I- q is a 在校学生 .进行的进行的
I am a language model,
And here's largest
No 'big' what The speech speech
懈口令公 quo I'm wrong
您傘
您遮 Aut兼任
quality
语言
@MasterJH5574 I also tried https://huggingface.co/shenzhi-wang/Llama3-8B-Chinese-Chat seems like below CLI work with common output mlc_llm chat ./dist/Llama3-8B-Chinese-Chat-MLC/ --device "cuda" --model-lib ./dist/libs/Llama3-8B-Chinese-Chat/Llama3-8B-Chinese-Chat-q4f16_1-cuda.so
introduce shanghai Shanghai, the "Pearl of the Orient," is an iconic metropolis that effortlessly blends traditional Chinese culture and history with its modern, cosmopolitan flair. Located on the eastern coast of China, this vibrant city is one of the most populous and culturally significant urban centers in the world.
As part of the Yangtze River Delta, Shanghai has been an essential trading point for centuries, eventually emerging as a major political and economic hub in the People's Republic of China. It's home to a magnitude of historical, cultural, and architectural marvels that span across eras and styles – the modern skyscrapers of the stunning skyline aptly complemented by tranquil Chinese pavilions and gardens.
From world-renowned attractions like the Shanghai Tower and the iconic Bund, to the restored classical charm of the old French Concession and the beckoning call of sweet juicy sheng jian bing (crispy-skinned pancake) and a steaming pot of xiaolongbao (soup-filled dumplings), this city is a must-visit destination for anyone eager to explore the unique dynamism of contemporary China. With its people, food, and riveting history, Shanghai is sure to meet and, likely, exceed the expectations of curious and keen travelers.
@lihaofd Thanks for sharing so much information. We'll look into this.
internlm2 meet this error.
v0.9.dev0/mlc_ai_nightly_cu122-0.15.dev519-cp310-cp310-manylinux_2_28_x86_64.whl
hello, have you fixed this bug? @lihaofd