llm Use the HuggingFace llama Tokenizer

The tokenizers crate by HuggingFace should give us a more correct tokenizer implementation than the one we're currently using.

Looks like a LLaMA implementation already landed there https://github.com/huggingface/transformers/pull/21955, and then @Narsil shared an additional PR on the tokenizers crate (not sure what this fixes, but I assume the changes are necessary?) https://github.com/huggingface/tokenizers/pull/1183

Seems like we have everything we need to use the new tokenizer. An important point remains though: Are we allowed to distribute the tokenizer file? Can it be considered a completely independent thing from the weights?

Mar 18 '23 10:03 setzer22

Alright, I did a first attempt, but couldn't manage to get it working. Here's what I tried:

Pulled the https://github.com/huggingface/transformers/ repository.
Installed torch using pip install torch
Ran the converter script, to convert both the weights, and the tokenizer, to the huggingface format, i.e.:

python3 src/transformers/models/llama/convert_llama_weights_to_hf.py --input_dir "/data/Llama/LLaMA/" --model_size 7B --output_dir /data/Llama/LLaMA/7B

Added the https://github.com/huggingface/tokenizers/ crate as a git dependency in our Cargo.toml, pointing to the lastest main branch commit.
Tried loading the tokenizer from the file, as suggested:

use tokenizers::tokenizer::{Result, Tokenizer};
let tokenizer = Tokenizer::from_file("/data/Llama/LLaMA/7B/tokenizer/tokenizer.model").unwrap();

Here, I got an "invalid UTF-8" error. By digging into the source, I figured out this expects a JSON file, so I tried pointing it at tokenizer_config.json, but that didn't work either :thinking:

Error("expected `,` or `}`", line: 1, column: 13)'

Digging more into the source, it doesn't even look like the file I got is correct. Perhaps I need to convert it in some other way?

Pinging @Narsil again, if you would be so kind to give us a hand here :sweat_smile:

Mar 18 '23 11:03 setzer22

Apart from my initial exploration, I also realized the tokenizers crate brings in a ton of dependencies, plus requires installed OpenSSL libraries to build.

I don't think all this is (especially OpenSSL) is needed just to get a tokenizer working, so we should look at this dependency more carefully. Maybe there's a way to extract just the bits we need?

Mar 18 '23 11:03 setzer22

Hey you're trying to convert the model. There are other scripts for the tokenizer. I haven't finished it yet (just requires more testing).

For dependencies you can use no-default - features. It does depend on esaxx-rs and onig which are not entirely needed for this specific tokenizer but the lib covers a bit more. Will share here once the file is done (and checked against)

Mar 18 '23 13:03 Narsil

the tokenizer is ready here: https://huggingface.co/hf-internal-testing/tiny-random-llama/tree/main

But it does require tokenizers@main and is not released yet. Will try to do a release next week (there's still a few needed updates within transformers, and some additional check since the change is much bigger than anticipated)

Mar 26 '23 17:03 Narsil

tokenizers=0.13.3 is released and can be used.

The tokenizer is here https://huggingface.co/hf-internal-testing/llama-tokenizer (tokenizer.json).

let tokenizer = Tokenizer::from_file("tokenizer.json").unwrap();
let encoded = tokenizer.encode("This is a test");
# None is the optional second sentence
# true is wether to add special tokens
let encoded = tokenizer.post_process(encoded, None, true).unwrap();

https://docs.rs/tokenizers/0.13.3/tokenizers/tokenizer/struct.TokenizerImpl.html#method.post_process

Cheers !

Apr 05 '23 14:04 Narsil

Hi @Narsil! Thanks a lot :)

We are evaluating what's the best route to integrate this, I have a few questions if you don't mind:

We are considering a potential integration of BLOOM and RWKV in the future. Would it be possible to use this library to tokenize input for those models?
Do you happen to know what are the tokens 3 to 258 used for? They seem to be used to represent raw byte data. Is the point of this to allow the model to represent non-utf8 sequences of characters? How does the library handle these tokens when decoding back to string?

Apr 06 '23 10:04 setzer22

We are considering a potential integration of BLOOM and RWKV in the future. Would it be possible to use this library to tokenize input for those models?

Bloom is supported, with exactly the same code (just use the file from bigscience/bloom)

Do you happen to know what are the tokens 3 to 258 used for? They seem to be used to represent raw byte data. Is the point of this to allow the model to represent non-utf8 sequences of characters? How does the library handle these tokens when decoding back to string?

These are the "byte-fallback" tokens. When encountering 'UNK' tokens, the bytefallback with split the char(s) into raw bytes, and use the tokens appropriately.

When decoding, it will attempt to interpret the bytes as utf-8 and use the unknown glyph � in case of failures for each invalid byte:

https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/decoders/byte_fallback.rs#L47

This is mirroring exactly what sentencepiece does.

Apr 06 '23 11:04 Narsil

For rmkv I have no idea what tokenizer they use. Do you have a link?

Apr 06 '23 11:04 Narsil

and RWKV in the future.

The official RWKV project uses the Python version of tokenizers.

I'm also using it in my little RWKV inference experiment if an example of use would be helpful: https://github.com/KerfuffleV2/smolrsrwkv

You will need the .json file which defines the tokenizer RWKV models are set up to use. You can find it here: https://github.com/BlinkDL/ChatRWKV/blob/main/v2/20B_tokenizer.json

Apr 06 '23 16:04 KerfuffleV2

It seems like the current tokenizer can't handle non-English? For example:

### Human: 请给我讲一个关于狐狸的故事。 as the prompt results in:

2023-04-07T14:56:15Z ERROR llama_cli] Failed to tokenize initial prompt.

But llama.cpp works fine:

### Human: 请给我讲一个关于狐狸的故事。

### Assistant:

从前有一只非常机智的狐狸。她名叫小美，是整个森林中最受欢迎的狐狸之一。有一天，小美在追逐一只松鼠时，她发现了一件奇怪的东西。

这是一条光明的线条，从美麦上看起来是散发的，但是实际上跟随着小美的动作。小美快速找到了这个线条并试图拉响他们之间的联系。然而，在她的拼命中失去了一部分半透明绳子！

小美意识到现在只剩下几乎一线，但是她还是希望能够找到这条完整的线条。于是，她开始了旅程，去各个方向走动，细嫩地跟随着每一乎线之间的疑问线索。

小美花费不少时间才发现了那段完整的线条了。很好奇，这条线在哪里面与线走在各个方向之间是什么关系？小美就开始了她的冒险，她想要知道这条线和它所跟随的疑问与它具有何种联系。

在她一直追逐后，小美渐渐意识到了一个很特别的现象。每当她接近线走的一根线时，都会被拉动；然而，每当她与线走的附近时，他们之间是两根不相连的线段！

小美很困惑，为什么这条线和那个线之间要有这样一个特殊的联系？在她追逐着不断放弃之前的好奇心中，她开始意识到了线之间并存的其他奇妙现象。

举个例子：有一次，小美发现他们遇到了一只大雌狗。这只雌狗实在是线走的连续的线段！他们向前攀爬，向后攀爬，但是没有放开线条之间的联系，从而成为了连接着两个线段的中转点。

尽管雌狗已经有足够的体质，但是小美还是不会放过这种奇妙现象，最终也成了一只身旁的狐狸。而那条连环形的线段就可能让他们变得永远隔开，无法在互相之间创造联系。

Also I'm really impressed, it seems like Vicuna 13B can actually speak Mandarin!

Apr 07 '23 15:04 KerfuffleV2

I'll test that with my fix for #11 which I suspect is the same issue

Apr 07 '23 16:04 philpax

Yes, it looks like the same thing to me as well.

Apr 07 '23 16:04 KerfuffleV2

With #122, your sample prompt tokenizes correctly, but doesn't produce any output with Alpaca. I'll have to get my hands on a compatible Vicuna 13B at some point...

Apr 07 '23 17:04 philpax

llm llm copied to clipboard

Use the HuggingFace llama Tokenizer

llm
llm copied to clipboard