guidance
guidance copied to clipboard
Trouble with loading mistral via transformers
The bug
On a freshly created conda environment, attempting to load mistral-7b via Hugging Face fails.
To Reproduce
This is based on PR #741
git checkout riedgar-ms/enable-transformers-7b
conda create -n guidance-312 python=3.12
conda activate guidance-312
pip install -e .[test]
pip install accelerate llama-cpp-python
python -m pytest --selected_model transformers_mistral_7b_gpu .\tests\library\test_gen.py::test_various_regexes
I wind up with errors:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests\utils.py:19: in get_model
return get_transformers_model(model_name[13:], caching, **kwargs)
tests\utils.py:60: in get_transformers_model
transformers_model_cache[key] = guidance.models.Transformers(
guidance\models\transformers\_transformers.py:209: in __init__
TransformersEngine(model, tokenizer, compute_log_probs, **kwargs),
guidance\models\transformers\_transformers.py:114: in __init__
TransformersTokenizer(model, tokenizer),
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <guidance.models.transformers._transformers.TransformersTokenizer object at 0x000001741467D130>, model = 'mistralai/Mistral-7B-v0.1'
tokenizer = LlamaTokenizerFast(name_or_path='mistralai/Mistral-7B-v0.1', vocab_size=32000, model_max_length=1000000000000000019884...special=True),
2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
ignore_bos_token = False
def __init__(self, model, tokenizer, ignore_bos_token=False):
if tokenizer is None:
tokenizer = self._tokenizer(model)
self._orig_tokenizer = tokenizer
# build out the set of byte_string tokens
byte_tokens = []
if hasattr(tokenizer, "byte_decoder"):
byte_decoder = tokenizer.byte_decoder
for i in range(len(tokenizer)):
byte_coded = bytes([byte_decoder[c] for c in tokenizer.convert_ids_to_tokens(i)])
byte_tokens.append(byte_coded)
elif hasattr(tokenizer, "sp_model"):
space_prefix = '▁'.encode()
for i in range(len(tokenizer)):
byte_coded = re.sub(br'<0x(..)>', lambda x: bytes.fromhex(x[1].decode()), tokenizer.sp_model.id_to_piece(i).encode())
byte_tokens.append(byte_coded.replace(space_prefix, b" "))
else:
import transformers
byte_decoder = transformers.AutoTokenizer.from_pretrained("gpt2", use_fast=False).byte_decoder # fall back to gpt2 mapping
# some special tokens may not have their whitespace encoded...
byte_decoder[' '] = 32
byte_decoder['\n'] = 10
byte_decoder['\r'] = 13
byte_decoder['\t'] = 9
byte_decoder['▁'] = 32
# run a quick spot check to verify we can rebuild complex multi-token unicode symbols
s = "’•¶∂ƒ˙∆£Ħ爨ൠᅘ∰፨"
t = tokenizer
reconstructed = b''
for id in t(s)["input_ids"]:
> reconstructed += bytes([byte_decoder[c] for c in t.convert_ids_to_tokens(id)])
E KeyError: '’'
System info (please complete the following information):
- OS (e.g. Ubuntu, Windows 11, Mac OS, etc.): Win11
- Guidance Version (
guidance.__version__): Synced fork
Note that AFAICT the actions on the PR are being OoM (or disk space) killed. However, that's another problem.
In my case, this is due the mistral tokenizer fell back to fast tokenizer which made the sp_model missing, installing sentencepiece solved it for me.
But then I get error on cleanup tokens
So I mod it like this:
# ugly hack to deal with sentence peice craziness of space hiding after special tokens
# TODO: figure out how to make this more robust
diff = token_byte_positions[-1] - last_pos
if diff > 0:
for _ in range(diff):
if self.tokenizer.tokens[token_ids[0]] == b'<s>' \
and self.tokenizer.tokens[token_ids[1]][0:1] == b' ':
for i in range(1, len(token_byte_positions)):
token_byte_positions[i] -= 1
assert token_byte_positions[-1] == last_pos
Hmmm.... adding sentencepiece to my pip installs is at least allowing my tests to get further. However, things are running a bit slowly, and I don't know if they will succeed yet.
Forgive me if I'm wrong,
The problem occurs because the default gpt2 byte encoder doesn't contain all of unicode characters.
This is from gpt2 byte_encoder which mapped to bytes_to_unicode:
list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
From GPT2Tokenizer init function:
self.byte_encoder = bytes_to_unicode()
self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
String ’•¶∂ƒ˙∆£Ħ爨ൠᅘ∰፨ fail since it contains characters that are not in the list.
So, the question is, is it necessary to check this string?
The assert on that string should definitely be moved to a separate test. That might let some things work, but the underlying problem would still remain - the model can't cope with some valid unicode strings.
I think this should just give a warning instead.
I mean the original issue with mistral can already be solved with installing sentencepiece.
The gpt2 is already a worst case scenario right? And realistically, it's not possible to support every model out there.
Just give a warning that the model have no byte decoder or some message to inform the user.
sentencepiece
hey, @yonitjio , I don't understand why installing sentencepiece would solve this problem. According to the code, it seems like it would still go to the branch that uses gpt2?
If you don't install sentencepiece, the tokenizer will fallback to fast tokenizer which doesn't have sp_model.
See here https://github.com/guidance-ai/guidance/blob/e234c565b61ffb90dbbf81cd937a00505ef79649/guidance/models/transformers/_transformers.py#L99
I mean the original issue with mistral can already be solved with installing
sentencepiece.
If you don't install
sentencepiece, the tokenizer will fallback to fast tokenizer which doesn't havesp_model.See here
https://github.com/guidance-ai/guidance/blob/e234c565b61ffb90dbbf81cd937a00505ef79649/guidance/models/transformers/_transformers.py#L99
I understand what you mean, but I'm currently using BloomTokenizer, and when using it, I can only set use_fast = True, because there is only the file tokenization_bloom_fast.py. This results in the tokenizer I get not having the two attributes byte_decoder and sp_model. Now I guess that for all fast tokenizers, their mapping relationship between bytes and unicode is the same as gpt2's, so gpt2.byte_decoder can be used as a substitute.
I suppose so.
But as I said before, I don't think it's realistic to support every model out there (for now?).
I can only think one other option instead of giving warning to user, that is to allow custom function for this.