llama.cpp
llama.cpp copied to clipboard
llama : add T5 (encoder-decoder) support
Still not familiar with the details, but it seems it would be useful to support this architecture in llama.cpp
. First, need to decide on the API and see what changes would be necessary
See discussion here: https://github.com/ggerganov/llama.cpp/issues/247
@ggerganov Does this mean llama.cpp could support something like the new GritLM model which can handle both text representations and text generation? I tried the embedding sample with gritlm but the resulting embeddings don't look right.
Some references: https://github.com/ContextualAI/gritlm/blob/92025b16534712b31b3c4aaaf069350e222bd5f8/gritlm/gritlm.py#L93 https://huggingface.co/GritLM/GritLM-7B
The issue is about different architecture (encoder + decoder). GritLM looks like decoder-only Mistral fine-tune, so it should already work. If you think the results are not OK, you can open an issue with steps to reproduce
I am looking forward to this. How many work would be needed to implement this?
@dranger003 Probably that's because GritLM uses 2 prompt templates, one is used only for text generation and one only for embedding. Can you try embedding with the template specified by the author?
Feel free to open a dedicated issue to discuss in details.
@ngxson thanks, I used the proper template. I opened an issue with a sample program.
T5 support would be truly awesome, expanding opportunities for numerous enterprise use cases.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Any update on this issue?
Any news on this one? :)
I'm also waiting for T5 (i.e. encoder-decoder) support in llama.cpp. Why? Because I could not find any embeddable (C++, C or Rust) T5 implementation with KV cache, out-of-the-box quantization and grammar support. I wish I could help with the development, but this is currently out of my league. 🥲
Would be also nice for fast Image generation embeddings encoding, like in PixArt and the soon upcoming StableDiffusion3 on 12th of June (As they utilize T5)
I have T5 working in llama.cpp, but the code needs to be cleaned up and it still uses additional header file (darts.h - Double-ARray Trie System, MIT license) needed by the unigram tokenizer implementation. Git diff if 2.5k lines long ;_;
What functionality does darts.h
provide? If it is just for performance string searches, we can replace it with some basic naive implementation for start
What functionality does
darts.h
provide? If it is just for performance string searches, we can replace it with some basic naive implementation for start
@ggerganov It's a C++ header-only trie implementation. Currently it's used in three places:
- Finding user-defined tokens during input normalization, so they won't be normalized
- Normalization of input before tokenization
- Finding tokens during tokenization
While 1 and 3 could be replaced with naive string search, for 2 the trie is created based on precompiled_charsmap from SentencePiece tokenizer model. It's basically a binary blob containing pre-tokenization normalization rules. Some information about it is here. I didn't examine it in detail, so not sure yet if normalization rules can be applied without using the trie.
Things are going better than expected - I managed to get rid of the darts.h
dependency and implement necessary functionality. My naive trie implementation is 2x slower compared to darts.h
and more memory-hungry, but I guess we can build from that. Still have some code to rewrite, but shouldn't take as long as I initially thought.
I added a branch with my T5 implementation: https://github.com/fairydreaming/llama.cpp/tree/t5 This is still a work in progress. For now I modified main.cpp to include llama_encode() call and pass computed encoder embeddings to llama_decode(), so you can test it with llama-cli command if you want:
./llama-cli -m models/t5-small.gguf -p 'translate English to German: The house is wonderful.'
shall result in:
...
system_info: n_threads = 32 / 64 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 0
Das Haus ist wunderbar. [end of text]
llama_print_timings: load time = 19.07 ms
llama_print_timings: sample time = 0.10 ms / 6 runs ( 0.02 ms per token, 63157.89 tokens per second)
llama_print_timings: prompt eval time = 4.19 ms / 11 tokens ( 0.38 ms per token, 2625.93 tokens per second)
llama_print_timings: eval time = 12.62 ms / 6 runs ( 2.10 ms per token, 475.62 tokens per second)
llama_print_timings: total time = 32.13 ms / 17 tokens
Log end
I tried T5-small, T5-base and T5-large, they all seem to work OK. Also compared layer outputs of T5-small with transformers implementation, looks the same.
Edit: forgot to mention that tests/test-c.c currently doesn't compile in the branch since I added some default argument values in headers. This is normal. ;)
Very cool!
I'm wondering about the extended llama_batch
with the n_enc_output
and enc_output
members. Is there some way in which the enc_output
is never presented to the user and remains internally within the llama_context
. Looking for ways to simplify the interface. If the encoded embeddings remain within the context, then we don't have to explicitly pass them later to llama_decode
.
@ggerganov Good advice, I did that and it definitely simplified things, also added is_encoding flag in the context to avoid passing additional parameters. I still need to research how batches work to properly support that.
@ggerganov Do you think it's better to create a separate example for encoder-decoder models or to modify llama-cli command to include llama_encode() call like I did in my branch? In the second case I think we would need some additional API calls:
- First one to distinguish encoder-decoder models from other models, so we could conditionally call
llama_encode()
for them. It could be for example something likellama_model_arch()
returning an enum value (e.g.LLAMA_ARCH_TYPE_(ENCODER_ONLY|DECODER_ONLY|ENCODER_DECODER)
orLLAMA_ARCH_TYPE_(AUTOENCODING|AUTOREGRESSIVE|SEQ2SEQ)
. - Second one to get decoder start token id to prepare input for llama_decode(), for example
llama_token_decoder_start()
.
Any better ideas?
It seems ok to merge into the existing llama-cli
example - we can revisit later.
- Maybe
bool llama_model_has_encoder()
seems simpler? - Not sure about this - I don't see how the start token is used in your branch
Btw, you may want to open a draft PR so we can discuss the changes easier
It seems ok to merge into the existing
llama-cli
example - we can revisit later.
- Maybe
bool llama_model_has_encoder()
seems simpler?
OK
- Not sure about this - I don't see how the start token is used in your branch
https://github.com/fairydreaming/llama.cpp/blob/205fee33ca7b893f10f14c6350ae620f0724f640/examples/main/main.cpp#L504-L513
After llama_encode()
call I clear the embd_inp
and append a llama_token_pad(model)
. This token is used in T5 as the first token in the token sequence autoregressively generated by the decoder (they call it decoder start token in HF transformers impl). It may be different than PAD in other models, so I think we need some general way to get it.
Btw, you may want to open a draft PR so we can discuss the changes easier
Yeah, will do that soon.
This token is used in T5 as the first token in the token sequence autoregressively generated by the decoder (they call it decoder start token in HF transformers impl). It may be different than PAD in other models, so I think we need some general way to get it.
Got it. Maybe llama_token_bod()
(i.e. "beginning of decoding")?
Btw, to me it would have made more sense if that token was the standard BOS token and the token used at the start of the encoded text to be something like "beginning of context/task" token
I decided to add T5 support in a series of smaller PR instead of one giant PR to facilitate code review and merging. The first PR is #8055, it adds model conversion support.
First PR is now merged, I created another one adding the Unigram tokenizer: #8089
Second PR is merged, the third one is ready: #8141
After this it will be possible to use T5 models with llama-cli
.
~~There are still some problems with CUDA backend, though.~~
Edit: problems were caused by insufficient range of f16 type that resulted in -inf and later in nan values, so avoid f16 with CUDA backend. With f32 there are no problems.
This is really great work, and I'm excited to watch all the progress happening on this model architecture. I have one (hopefully) small request to support the ByT5 tokenizer for that family of T5 models. It has some unique use cases where it is really useful in things like multilingual tasks and text denoising.
Fortunately it is a very simple tokenizer that is mostly just utf-8 mapped directly to integers with some special tokens. https://github.com/huggingface/transformers/blob/main/src/transformers/models/byt5/tokenization_byt5.py
The third (and final) PR is now merged. TODO some day: add support for encoder-decoder models in llama-server.
The third (and final) PR is now merged. TODO some day: add support for encoder-decoder models in llama-server.
@abetlen @fairydreaming
Hi,
Does this code land in llama.dll, cause I use llama-cpp-python which uses llama.dll
Great job btw, thanks
Does this code land in llama.dll, cause I use llama-cpp-python which uses llama.dll
@Sadeghi85 I suppose the new code is there, but to use encoder-decoder models like T5 you have to use new API functions: llama_model_has_encoder(), llama_encode(), llama_model_decoder_start_token(). So I think you have to wait until the llama-cpp-python author (@abetlen) adds support for this.