guidance icon indicating copy to clipboard operation
guidance copied to clipboard

Trouble with loading mistral via transformers

Open riedgar-ms opened this issue 1 year ago • 10 comments

The bug

On a freshly created conda environment, attempting to load mistral-7b via Hugging Face fails.

To Reproduce

This is based on PR #741

git checkout riedgar-ms/enable-transformers-7b

conda create -n guidance-312 python=3.12
conda activate guidance-312

pip install -e .[test]
pip install accelerate llama-cpp-python

python -m pytest --selected_model transformers_mistral_7b_gpu .\tests\library\test_gen.py::test_various_regexes

I wind up with errors:

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests\utils.py:19: in get_model
    return get_transformers_model(model_name[13:], caching, **kwargs)
tests\utils.py:60: in get_transformers_model
    transformers_model_cache[key] = guidance.models.Transformers(
guidance\models\transformers\_transformers.py:209: in __init__
    TransformersEngine(model, tokenizer, compute_log_probs, **kwargs),
guidance\models\transformers\_transformers.py:114: in __init__
    TransformersTokenizer(model, tokenizer),
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <guidance.models.transformers._transformers.TransformersTokenizer object at 0x000001741467D130>, model = 'mistralai/Mistral-7B-v0.1'
tokenizer = LlamaTokenizerFast(name_or_path='mistralai/Mistral-7B-v0.1', vocab_size=32000, model_max_length=1000000000000000019884...special=True),
        2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
ignore_bos_token = False

    def __init__(self, model, tokenizer, ignore_bos_token=False):
        if tokenizer is None:
            tokenizer = self._tokenizer(model)


        self._orig_tokenizer = tokenizer

        # build out the set of byte_string tokens
        byte_tokens = []
        if hasattr(tokenizer, "byte_decoder"):
            byte_decoder = tokenizer.byte_decoder

            for i in range(len(tokenizer)):
                byte_coded = bytes([byte_decoder[c] for c in tokenizer.convert_ids_to_tokens(i)])
                byte_tokens.append(byte_coded)

        elif hasattr(tokenizer, "sp_model"):
            space_prefix = '▁'.encode()
            for i in range(len(tokenizer)):
                byte_coded = re.sub(br'<0x(..)>', lambda x: bytes.fromhex(x[1].decode()), tokenizer.sp_model.id_to_piece(i).encode())
                byte_tokens.append(byte_coded.replace(space_prefix, b" "))

        else:
            import transformers
            byte_decoder = transformers.AutoTokenizer.from_pretrained("gpt2", use_fast=False).byte_decoder # fall back to gpt2 mapping

            # some special tokens may not have their whitespace encoded...
            byte_decoder[' '] = 32
            byte_decoder['\n'] = 10
            byte_decoder['\r'] = 13
            byte_decoder['\t'] = 9
            byte_decoder['▁'] = 32

            # run a quick spot check to verify we can rebuild complex multi-token unicode symbols
            s = "’•¶∂ƒ˙∆£Ħ爨ൠᅘ∰፨"
            t = tokenizer
            reconstructed = b''
            for id in t(s)["input_ids"]:
>               reconstructed += bytes([byte_decoder[c] for c in t.convert_ids_to_tokens(id)])
E               KeyError: '’'

System info (please complete the following information):

  • OS (e.g. Ubuntu, Windows 11, Mac OS, etc.): Win11
  • Guidance Version (guidance.__version__): Synced fork

riedgar-ms avatar Apr 03 '24 15:04 riedgar-ms

Note that AFAICT the actions on the PR are being OoM (or disk space) killed. However, that's another problem.

riedgar-ms avatar Apr 03 '24 16:04 riedgar-ms

In my case, this is due the mistral tokenizer fell back to fast tokenizer which made the sp_model missing, installing sentencepiece solved it for me.

But then I get error on cleanup tokens

So I mod it like this:

            # ugly hack to deal with sentence peice craziness of space hiding after special tokens
            # TODO: figure out how to make this more robust
            diff = token_byte_positions[-1] - last_pos
            if diff > 0:
                for _ in range(diff):
                    if self.tokenizer.tokens[token_ids[0]] == b'<s>' \
                        and self.tokenizer.tokens[token_ids[1]][0:1] == b' ':
                        for i in range(1, len(token_byte_positions)):
                            token_byte_positions[i] -= 1
            assert token_byte_positions[-1] == last_pos

yonitjio avatar Apr 07 '24 16:04 yonitjio

Hmmm.... adding sentencepiece to my pip installs is at least allowing my tests to get further. However, things are running a bit slowly, and I don't know if they will succeed yet.

riedgar-ms avatar Apr 16 '24 14:04 riedgar-ms

Forgive me if I'm wrong,

The problem occurs because the default gpt2 byte encoder doesn't contain all of unicode characters.

This is from gpt2 byte_encoder which mapped to bytes_to_unicode: list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))

From GPT2Tokenizer init function:

        self.byte_encoder = bytes_to_unicode()
        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}

String ’•¶∂ƒ˙∆£Ħ爨ൠᅘ∰፨ fail since it contains characters that are not in the list.

So, the question is, is it necessary to check this string?

yonitjio avatar Apr 29 '24 03:04 yonitjio

The assert on that string should definitely be moved to a separate test. That might let some things work, but the underlying problem would still remain - the model can't cope with some valid unicode strings.

riedgar-ms avatar Apr 29 '24 12:04 riedgar-ms

I think this should just give a warning instead.

I mean the original issue with mistral can already be solved with installing sentencepiece.

The gpt2 is already a worst case scenario right? And realistically, it's not possible to support every model out there.

Just give a warning that the model have no byte decoder or some message to inform the user.

yonitjio avatar May 11 '24 10:05 yonitjio

sentencepiece

hey, @yonitjio , I don't understand why installing sentencepiece would solve this problem. According to the code, it seems like it would still go to the branch that uses gpt2? image

LuoKaiGSW avatar Jun 04 '24 08:06 LuoKaiGSW

If you don't install sentencepiece, the tokenizer will fallback to fast tokenizer which doesn't have sp_model.

See here https://github.com/guidance-ai/guidance/blob/e234c565b61ffb90dbbf81cd937a00505ef79649/guidance/models/transformers/_transformers.py#L99

yonitjio avatar Jun 04 '24 09:06 yonitjio

I mean the original issue with mistral can already be solved with installing sentencepiece.

If you don't install sentencepiece, the tokenizer will fallback to fast tokenizer which doesn't have sp_model.

See here

https://github.com/guidance-ai/guidance/blob/e234c565b61ffb90dbbf81cd937a00505ef79649/guidance/models/transformers/_transformers.py#L99

I understand what you mean, but I'm currently using BloomTokenizer, and when using it, I can only set use_fast = True, because there is only the file tokenization_bloom_fast.py. This results in the tokenizer I get not having the two attributes byte_decoder and sp_model. Now I guess that for all fast tokenizers, their mapping relationship between bytes and unicode is the same as gpt2's, so gpt2.byte_decoder can be used as a substitute.

LuoKaiGSW avatar Jun 04 '24 11:06 LuoKaiGSW

I suppose so.

But as I said before, I don't think it's realistic to support every model out there (for now?).

I can only think one other option instead of giving warning to user, that is to allow custom function for this.

yonitjio avatar Jun 04 '24 14:06 yonitjio