TypeError when calling glm-4-9b-chat (cannot use a string pattern on a bytes-like object)
Describe the issue as clearly as possible:
When I use outlines to call glm-4-9b-chat to operate the classification task, I met the error that "cannot use a string pattern on a bytes-like object".
Steps/code to reproduce the bug:
from outlines import models, generate
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "glm-4-9b-chat"
llm = AutoModelForCausalLM.from_pretrained(f"/datas/huggingface/{model_name}",trust_remote_code=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(f"/datas/huggingface/{model_name}", trust_remote_code=True)
model = models.Transformers(llm,tokenizer)
generator = generate.choice(model,["positive","negative"])
Expected result:
positive OR negative
Error message:
Traceback (most recent call last):
File "/datas/wangm/seeker_status/test_classification_glm.py", line 17, in <module>
generator = generate.choice(model,["positive","negative"])
File "/datas/wangm/.conda/envs/llama/lib/python3.10/functools.py", line 889, in wrapper
return dispatch(args[0].__class__)(*args, **kw)
File "/datas/wangm/.conda/envs/llama/lib/python3.10/site-packages/outlines/generate/choice.py", line 17, in choice
generator = regex(model, regex_str, sampler)
File "/datas/wangm/.conda/envs/llama/lib/python3.10/functools.py", line 889, in wrapper
return dispatch(args[0].__class__)(*args, **kw)
File "/datas/wangm/.conda/envs/llama/lib/python3.10/site-packages/outlines/generate/regex.py", line 33, in regex
fsm = RegexGuide(regex_str, model.tokenizer)
File "/datas/wangm/.conda/envs/llama/lib/python3.10/site-packages/outlines/fsm/guide.py", line 145, in __init__
) = create_states_mapping(regex_string, tokenizer)
File "/datas/wangm/.conda/envs/llama/lib/python3.10/site-packages/outlines/caching.py", line 122, in wrapper
result = cached_function(*args, **kwargs)
File "/datas/wangm/.conda/envs/llama/lib/python3.10/site-packages/outlines/fsm/guide.py", line 118, in create_states_mapping
states_to_token_maps, empty_token_ids = create_fsm_index_tokenizer(
File "/datas/wangm/.conda/envs/llama/lib/python3.10/site-packages/outlines/fsm/regex.py", line 898, in create_fsm_index_tokenizer
vocabulary, empty_token_ids = reduced_vocabulary(tokenizer)
File "/datas/wangm/.conda/envs/llama/lib/python3.10/site-packages/outlines/fsm/regex.py", line 846, in reduced_vocabulary
if "\ufffd" in token_str and not re_replacement_seq.match(token):
TypeError: cannot use a string pattern on a bytes-like object
Outlines/Python version information:
Version information
Context for the issue:
Upon my testing, the error occurs in the generator = generate.choice(model,["positive", "negative"]) line of code.
Hi,
This is observed also in the regex generator for this model (outlines.generate.regex(model, decoding_regex, sampler=sampler)). @rlouf it would be great if you could please look into this. Always thankful for your great contribution.
Same problems found in Qwen family models.
Same problems in vLLM "response_format": {"type": "json_object"}
Anyone have a workaround?
I had the same problem with a glm4-9b model !
@XxxAtlantis any luck in figuring out a workaround?
I got the same error and delved into the code and the GLM tokenizer a bit. I believe the direct cause of the error is that the GLM tokenizer does not strictly adhere to the definition of get_vocab() from the Transformers library. According to the documentation: https://huggingface.co/docs/transformers/v4.44.0/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.get_vocab get_vocab() is supposed to return the vocabulary in string format, and Outlines uses string utilities like startswith to process the language model's vocabulary and construct its FSM. However, the get_vocab() function of the GLM tokenizer returns the vocabulary in byte format, which is incompatible with string utilities.
Simply converting the byte-based vocabulary to strings will not fix the issue. I tried this approach and discovered that parts of the GLM vocabulary cannot be converted to UTF-8 strings. It turns out that Qwen and GLM use the Byte Pair Encoding (BPE) tokenization technique, and some tokens are not valid UTF-8 encodings. More information can be found here: https://github.com/QwenLM/Qwen/blob/main/tokenization_note.md
In conclusion:
The error can only be resolved by having Outlines support BPE byte tokens. Chinese language models like InternLM, which do not use BPE, are more compatible with Outlines.
I have a simple fix for the issue.
It seems glm-4-9b-chats tokenizer uses BPE slightly differently from the LLaMA-style BPE. Specifically, glm-4-9b's tokenizer explicitly uses bytes rather than padding and converting to a string.
from outlines import models, generate
from transformers import AutoModelForCausalLM, AutoTokenizer
llm = AutoModelForCausalLM.from_pretrained("THUDM/glm-4-9b-chat",trust_remote_code=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat", trust_remote_code=True)
model = models.Transformers(llm,tokenizer)
generator = generate.choice(model, ['和长', '本代模'])
input = '我们在一些经典任务上对 GLM-4-9B-Chat 模型进行了评测,并得到了如下的结果'
print(generator(input))
# '和长'
Could you please help me test this PR and ensure it resolves the issues you all have been seeing?
Preview:
pip uninstall -y outlines
pip install --upgrade git+https://github.com/lapp0/outlines@fix-bpe
I have a simple fix for the issue.
It seems
glm-4-9b-chats tokenizer uses BPE slightly differently from the LLaMA-style BPE. Specifically, glm-4-9b's tokenizer explicitly uses bytes rather than padding and converting to a string.from outlines import models, generate from transformers import AutoModelForCausalLM, AutoTokenizer llm = AutoModelForCausalLM.from_pretrained("THUDM/glm-4-9b-chat",trust_remote_code=True, device_map="auto") tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat", trust_remote_code=True) model = models.Transformers(llm,tokenizer) generator = generate.choice(model, ['和长', '本代模']) input = '我们在一些经典任务上对 GLM-4-9B-Chat 模型进行了评测,并得到了如下的结果' print(generator(input)) # '和长'Could you please help me test this PR and ensure it resolves the issues you all have been seeing?
Preview:
pip uninstall -y outlines pip install --upgrade git+https://github.com/lapp0/outlines@fix-bpe
I tried this version of the package. But I met the error that
File "<frozen importlib._bootstrap>", line 571, in module_from_spec
File "<frozen importlib._bootstrap_external>", line 1176, in create_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: libharpcuda.so.0: cannot open shared object file: No such file or directory
@sci-m-wang can you provide the full traceback? I see a similar error relating to triton in the Unsloth repo https://github.com/unslothai/unsloth/issues/872
@sci-m-wang you can disable Dynamo error suppression by
import torch._dynamo
torch._dynamo.config.suppress_errors = True
@sci-m-wang you can disable Dynamo error suppression by
import torch._dynamo torch._dynamo.config.suppress_errors = True
If this doesn't work, try disabling the dynamic graph as well.
import torch
torch._dynamo.config.disable = True