llama.cpp Improve support for special tokens

Fixes #1812 Fixes #1501

Hello! This is my first attempt to contribute to this project, so I apologize in advance for any mistakes. This PR should add a basic support for special tokens and improve the support for added tokens. All special tokens come from the file added_tokens.json and the file special_tokens_map.json or the file tokenizer_config.json ~~(I have no idea if it's safe to rely on only one so I added tokenizer_config.json as a fallback).~~ EDIT: The loading of the jsons now seems to follow the huggingface implementation a bit better.

The most important points of this PR are:

~~The GGML format was changed due to the requirement of a way to know which tokens are the special tokens.~~ EDIT: This isn't necessary anymore.
The tokenizer now uses a trie algorithm to efficiently split the prompt based on the special tokens, this was necessary because the BPE tokenizer isn't able to tokenize the special tokens by itself. Please note that this algorithm was ported from the huggingface/transformers repository, so I wonder if this could cause license issues?

Using this PR, this is the output of --verbose-prompt:

main: prompt: ' One Two</s>Three</s> Four '
main: number of tokens in prompt = 8
     1 -> '<s>'
  3118 -> ' One'
  7803 -> ' Two'
     2 -> '</s>'
 28575 -> 'Three'
     2 -> '</s>'
 12458 -> ' Four'
 29871 -> ' '

when the model is converted to ggml with this special_tokens_map.json:

{
    "bos_token": {
      "content": "<s>",
      "lstrip": false,
      "normalized": true,
      "rstrip": false,
      "single_word": false
    },
    "eos_token": {
      "content": "</s>",
      "lstrip": false,
      "normalized": true,
      "rstrip": false,
      "single_word": false
    },
    "pad_token": "<unk>",
    "unk_token": {
      "content": "<unk>",
      "lstrip": false,
      "normalized": true,
      "rstrip": false,
      "single_word": false
    }
  }

Jun 19 '23 00:06 Igoorx

A breaking change to the GGML format might be a tough sell (but don't take my personal opinion as speaking for the project in any way). You might consider allowing a commandline option and/or API addition to allow just reading the special tokens from a separate file or list.

so I wonder if this could cause license issues?

llama.cpp is under MIT and transformers seems to be Apache 2.0. I'm not qualified to say what the issue is or how to fix it, but the projects do have different licenses. I don't know if there's any kind of policy for dealing with that situation already in place. Perhaps someone else can give you a better answer, but until that part is resolved I'd guess you shouldn't expect your PR to get merged (again, not speaking with any kind of authority).

Jun 19 '23 21:06 KerfuffleV2

A breaking change to the GGML format might be a tough sell (but don't take my personal opinion as speaking for the project in any way). You might consider allowing a commandline option and/or API addition to allow just reading the special tokens from a separate file or list.

On the bright side, the old format is still supported... But I agree with you and it's also something I would want to avoid, but the thing is, after looking at the code it looks like everything related to the vocab is integrated inside the ggml (e.g. the file added_tokens.json and the whole vocab itself), so I thought the only right choice was to include the list of special tokens too. I don't know what the maintainers of the project would prefer though, so I'm open to making any changes.

llama.cpp is under MIT and transformers seems to be Apache 2.0. I'm not qualified to say what the issue is or how to fix it, but the projects do have different licenses. I don't know if there's any kind of policy for dealing with that situation already in place.

Yes, this is something that concerns me a bit. I would appreciate feedback on whether I should be concerned about the license. After all, I'm not simply copy-pasting the code. Perhaps, the comment I added with the link to the original implementation would be sufficient? Nevertheless, even if the license still should be respected to the fullest, I am aware that Apache is compatible with MIT. Therefore, I believe there should be no issue with including that code in the repository. However, I also need feedback on this matter. This seems to be the first time this issue has arisen here, so I'm not sure whether the maintainers would want to add the license file of Hugging Face specifically for this piece of code. If they do, I'm not sure where I should add the license - perhaps just appending it to the LICENSE file?

Perhaps someone else can give you a better answer, but until that part is resolved I'd guess you shouldn't expect your PR to get merged (again, not speaking with any kind of authority).

You're absolutely right, no doubts about that, that's why I brought that up in the PR message, this is something quite important for an open-source project so it needs to be sorted out before the PR is merged.

Jun 19 '23 22:06 Igoorx

AFAIK, there is a discussion on a new format here: https://github.com/ggerganov/ggml/issues/220

You may want to chime in.

I don't believe I have anything to contribute to the discussion... but upon a closer look, it appears there is also a discussion regarding the tokenizer in the issue 🤔 I'm uncertain whether there is a possibility of this PR being merged or if the maintainers would prefer to wait for GGUF. I suppose I will modify the approach of this PR and alter the GGML format in such a manner that the new model format remains compatible with older versions. By doing so, if we still have doubts about when GGUF will be merged, this pull request could serve as a temporary solution to the issue of special tokens at least until that time.

Jun 20 '23 22:06 Igoorx

Hey @Igoorx what's the status of this PR? I'm really interested in this work.

I fine tune using LoRA and add a few special tokens, but then these aren't tokenized correctly when running inference on llama.cpp. I'm going to try your PR and see if it helps things.

Aug 07 '23 10:08 grantbey

@grantbey It's finished, but since the maintainers showed no interest whatsoever in merging it, I didn't resolve the merge conflicts. If you can't do that by yourself, you should just wait for GGUF, it's right around the corner: https://github.com/ggerganov/llama.cpp/pull/2398

Aug 07 '23 13:08 Igoorx

Thanks. I’ve started taking a stab at the merge conflicts but I imagine I’ll get a few things wrong and it’ll end up consuming a lot of my time 😅

I read about GGUF but I’ve got some deadlines coming up and I’d like a working solution sooner rather than later, and it’s not clear what kind of timeline to expect GGUF on.

On Mon, Aug 7, 2023 at 14:09 Igor Pissolati @.***> wrote:

@grantbey https://github.com/grantbey It is finished, but since the maintainers had no interest whatsoever in merging it I didn't fix the merge conflicts. If you can't do that by yourself, you should just wait for GGUF, since it's right around the corner: #2398 https://github.com/ggerganov/llama.cpp/pull/2398

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/pull/1931#issuecomment-1667830514, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXSGXU5IRMINHEY5PGJOX3XUDSKFANCNFSM6AAAAAAZLGK4LI . You are receiving this because you were mentioned.Message ID: @.***>

Aug 07 '23 13:08 grantbey

@grantbey I rebased the PR to the last master commit :+1:

Aug 07 '23 16:08 Igoorx

Ok wow @Igoorx you're amazing. Thank you so much!

Aug 07 '23 16:08 grantbey

@Igoorx : @klosax just made me aware that we are working on weaknesses of the tokenizer(s) at the same time. I'd greatly appreciate any cooperation on this. If I'd go to incorporate your changes my first question would be how to test them?

Aug 07 '23 19:08 goerch

@goerch Here is a simple test: https://github.com/ggerganov/llama.cpp/pull/1931/commits/6f7dabab441566078446ef868e573cd309fe62be

Aug 07 '23 20:08 Igoorx

@goerch Here is a simple test: 6f7daba

Nice, thank you! Then my plan would be to try to first integrate your changes into #2315 and afterwards migrate the PR over to gguf. Would that be OK for you?

Aug 07 '23 20:08 goerch

@goerch Yeah, that's fine. You just have to be aware about: https://github.com/ggerganov/llama.cpp/blob/6f7dabab441566078446ef868e573cd309fe62be/llama-util.h#L554-L555 The trie algorithm used in the PR is a port from the huggingface repository, as written in the comment, so maybe something needs to be done about it. I'm not sure if that comment is enough or if it would be necessary to add the hf license somewhere.

Aug 07 '23 20:08 Igoorx

The gguf gpt2 tokenizer also have a Trie implementation. The tokenizer is on MIT license. Maybe it could be reused for the llama tokenizer.

Aug 07 '23 21:08 klosax

@goerch Yeah, that's fine. You just have to be aware about: ...

IANAL, but I'm equally concerned about compatibility with the sentencepiece license (Apache-2.0). They use a trie too, which might the base of the hf implementation? When developing #2315 I experimented with a simple clean room implementation of a trie already, I'll try to understand the differences.

@ggerganov and others: any opinions on this?

Aug 07 '23 21:08 goerch

The author of the gpt2 tokenizer gave permission to use it and stated it is on MIT license here https://github.com/ggerganov/llama.cpp/pull/2398#issuecomment-1667009979

Aug 07 '23 21:08 klosax

The important part is the split method https://github.com/ggerganov/llama.cpp/blob/6f7dabab441566078446ef868e573cd309fe62be/llama-util.h#L575-L577 If using the ported version isn't an option, it would be necessary to reimplement it using the other trie algorithm.

IANAL, but I'm equally concerned about compatibility with the sentencepiece license (Apache-2.0).

IANAL either, but I think Apache-2.0 and MIT should be compatible with each other: https://law.stackexchange.com/a/6732

Aug 07 '23 21:08 Igoorx

It looks like the MIT and Apache licenses are compatible, but a copy of the Apache license and a Notice file must be included: https://softwareengineering.stackexchange.com/questions/51987/how-to-include-an-apache-library-with-my-opensource-code#52223

Aug 07 '23 22:08 klosax

I'll recommend to either implement a trie from scratch, or use a linear search algorithm - we are not tokenizing billions of tokens, so not sure what we gain from using a trie.

Aug 08 '23 10:08 ggerganov

I'll recommend to either implement a trie from scratch, or use a linear search algorithm - we are not tokenizing billions of tokens, so not sure what we gain from using a trie.

If you say so then probably trie really is a premature optimization in this case... I changed the code to use linear search.

Aug 08 '23 15:08 Igoorx

@Igoorx : I took a look into your PR, but I believe we first have to sort out open problems with #2549. Sorry for the delay.

Aug 08 '23 20:08 goerch

@lgoorx : #2549 is done, I'm now waiting for review of https://github.com/ggerganov/llama.cpp/pull/3252 and am already downloading Vicuna-7B-v1.1.

Sep 19 '23 12:09 goerch

Currently discussing this at https://github.com/ggerganov/llama.cpp/issues/2820. Maybe close this one and participate over there?

Sep 20 '23 21:09 goerch

Looks like this PR was superseded by https://github.com/ggerganov/llama.cpp/pull/3538, from what I could see it looks great. Thanks for your attention @goerch! I don't think I have anything more to contribute.

Oct 12 '23 00:10 Igoorx

llama.cpp llama.cpp copied to clipboard

Improve support for special tokens

llama.cpp
llama.cpp copied to clipboard