llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

llama : add T5 (encoder-decoder) support

Open ggerganov opened this issue 11 months ago • 7 comments

Still not familiar with the details, but it seems it would be useful to support this architecture in llama.cpp. First, need to decide on the API and see what changes would be necessary

See discussion here: https://github.com/ggerganov/llama.cpp/issues/247

ggerganov avatar Feb 28 '24 11:02 ggerganov

@ggerganov Does this mean llama.cpp could support something like the new GritLM model which can handle both text representations and text generation? I tried the embedding sample with gritlm but the resulting embeddings don't look right.

Some references: https://github.com/ContextualAI/gritlm/blob/92025b16534712b31b3c4aaaf069350e222bd5f8/gritlm/gritlm.py#L93 https://huggingface.co/GritLM/GritLM-7B

dranger003 avatar Feb 28 '24 15:02 dranger003

The issue is about different architecture (encoder + decoder). GritLM looks like decoder-only Mistral fine-tune, so it should already work. If you think the results are not OK, you can open an issue with steps to reproduce

ggerganov avatar Feb 28 '24 15:02 ggerganov

I am looking forward to this. How many work would be needed to implement this?

sorasoras avatar Feb 28 '24 19:02 sorasoras

@dranger003 Probably that's because GritLM uses 2 prompt templates, one is used only for text generation and one only for embedding. Can you try embedding with the template specified by the author?

Feel free to open a dedicated issue to discuss in details.

ngxson avatar Feb 28 '24 21:02 ngxson

@ngxson thanks, I used the proper template. I opened an issue with a sample program.

dranger003 avatar Feb 28 '24 22:02 dranger003

T5 support would be truly awesome, expanding opportunities for numerous enterprise use cases.

Mihaiii avatar Mar 01 '24 22:03 Mihaiii

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Apr 17 '24 01:04 github-actions[bot]

Any update on this issue?

nooobkevin avatar Apr 30 '24 07:04 nooobkevin

Any news on this one? :)

Mihaiii avatar May 20 '24 07:05 Mihaiii

I'm also waiting for T5 (i.e. encoder-decoder) support in llama.cpp. Why? Because I could not find any embeddable (C++, C or Rust) T5 implementation with KV cache, out-of-the-box quantization and grammar support. I wish I could help with the development, but this is currently out of my league. 🥲

vladfaust avatar May 20 '24 10:05 vladfaust

Would be also nice for fast Image generation embeddings encoding, like in PixArt and the soon upcoming StableDiffusion3 on 12th of June (As they utilize T5)

kabachuha avatar Jun 03 '24 15:06 kabachuha

I have T5 working in llama.cpp, but the code needs to be cleaned up and it still uses additional header file (darts.h - Double-ARray Trie System, MIT license) needed by the unigram tokenizer implementation. Git diff if 2.5k lines long ;_;

fairydreaming avatar Jun 09 '24 20:06 fairydreaming

What functionality does darts.h provide? If it is just for performance string searches, we can replace it with some basic naive implementation for start

ggerganov avatar Jun 10 '24 06:06 ggerganov

What functionality does darts.h provide? If it is just for performance string searches, we can replace it with some basic naive implementation for start

@ggerganov It's a C++ header-only trie implementation. Currently it's used in three places:

  1. Finding user-defined tokens during input normalization, so they won't be normalized
  2. Normalization of input before tokenization
  3. Finding tokens during tokenization

While 1 and 3 could be replaced with naive string search, for 2 the trie is created based on precompiled_charsmap from SentencePiece tokenizer model. It's basically a binary blob containing pre-tokenization normalization rules. Some information about it is here. I didn't examine it in detail, so not sure yet if normalization rules can be applied without using the trie.

fairydreaming avatar Jun 10 '24 07:06 fairydreaming

Things are going better than expected - I managed to get rid of the darts.h dependency and implement necessary functionality. My naive trie implementation is 2x slower compared to darts.h and more memory-hungry, but I guess we can build from that. Still have some code to rewrite, but shouldn't take as long as I initially thought.

fairydreaming avatar Jun 11 '24 19:06 fairydreaming

I added a branch with my T5 implementation: https://github.com/fairydreaming/llama.cpp/tree/t5 This is still a work in progress. For now I modified main.cpp to include llama_encode() call and pass computed encoder embeddings to llama_decode(), so you can test it with llama-cli command if you want:

./llama-cli -m models/t5-small.gguf -p 'translate English to German: The house is wonderful.'

shall result in:

...
system_info: n_threads = 32 / 64 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 0


 Das Haus ist wunderbar. [end of text]

llama_print_timings:        load time =      19.07 ms
llama_print_timings:      sample time =       0.10 ms /     6 runs   (    0.02 ms per token, 63157.89 tokens per second)
llama_print_timings: prompt eval time =       4.19 ms /    11 tokens (    0.38 ms per token,  2625.93 tokens per second)
llama_print_timings:        eval time =      12.62 ms /     6 runs   (    2.10 ms per token,   475.62 tokens per second)
llama_print_timings:       total time =      32.13 ms /    17 tokens
Log end

I tried T5-small, T5-base and T5-large, they all seem to work OK. Also compared layer outputs of T5-small with transformers implementation, looks the same.

Edit: forgot to mention that tests/test-c.c currently doesn't compile in the branch since I added some default argument values in headers. This is normal. ;)

fairydreaming avatar Jun 13 '24 20:06 fairydreaming

Very cool!

I'm wondering about the extended llama_batch with the n_enc_output and enc_output members. Is there some way in which the enc_output is never presented to the user and remains internally within the llama_context. Looking for ways to simplify the interface. If the encoded embeddings remain within the context, then we don't have to explicitly pass them later to llama_decode.

ggerganov avatar Jun 14 '24 08:06 ggerganov

@ggerganov Good advice, I did that and it definitely simplified things, also added is_encoding flag in the context to avoid passing additional parameters. I still need to research how batches work to properly support that.

fairydreaming avatar Jun 14 '24 11:06 fairydreaming

@ggerganov Do you think it's better to create a separate example for encoder-decoder models or to modify llama-cli command to include llama_encode() call like I did in my branch? In the second case I think we would need some additional API calls:

  1. First one to distinguish encoder-decoder models from other models, so we could conditionally call llama_encode() for them. It could be for example something like llama_model_arch() returning an enum value (e.g. LLAMA_ARCH_TYPE_(ENCODER_ONLY|DECODER_ONLY|ENCODER_DECODER) or LLAMA_ARCH_TYPE_(AUTOENCODING|AUTOREGRESSIVE|SEQ2SEQ).
  2. Second one to get decoder start token id to prepare input for llama_decode(), for example llama_token_decoder_start().

Any better ideas?

fairydreaming avatar Jun 17 '24 14:06 fairydreaming

It seems ok to merge into the existing llama-cli example - we can revisit later.

  1. Maybe bool llama_model_has_encoder() seems simpler?
  2. Not sure about this - I don't see how the start token is used in your branch

Btw, you may want to open a draft PR so we can discuss the changes easier

ggerganov avatar Jun 17 '24 16:06 ggerganov

It seems ok to merge into the existing llama-cli example - we can revisit later.

  1. Maybe bool llama_model_has_encoder() seems simpler?

OK

  1. Not sure about this - I don't see how the start token is used in your branch

https://github.com/fairydreaming/llama.cpp/blob/205fee33ca7b893f10f14c6350ae620f0724f640/examples/main/main.cpp#L504-L513

After llama_encode() call I clear the embd_inp and append a llama_token_pad(model). This token is used in T5 as the first token in the token sequence autoregressively generated by the decoder (they call it decoder start token in HF transformers impl). It may be different than PAD in other models, so I think we need some general way to get it.

Btw, you may want to open a draft PR so we can discuss the changes easier

Yeah, will do that soon.

fairydreaming avatar Jun 17 '24 19:06 fairydreaming

This token is used in T5 as the first token in the token sequence autoregressively generated by the decoder (they call it decoder start token in HF transformers impl). It may be different than PAD in other models, so I think we need some general way to get it.

Got it. Maybe llama_token_bod() (i.e. "beginning of decoding")?

Btw, to me it would have made more sense if that token was the standard BOS token and the token used at the start of the encoded text to be something like "beginning of context/task" token

ggerganov avatar Jun 18 '24 08:06 ggerganov

I decided to add T5 support in a series of smaller PR instead of one giant PR to facilitate code review and merging. The first PR is #8055, it adds model conversion support.

fairydreaming avatar Jun 22 '24 07:06 fairydreaming

First PR is now merged, I created another one adding the Unigram tokenizer: #8089

fairydreaming avatar Jun 24 '24 07:06 fairydreaming

Second PR is merged, the third one is ready: #8141 After this it will be possible to use T5 models with llama-cli. ~~There are still some problems with CUDA backend, though.~~

Edit: problems were caused by insufficient range of f16 type that resulted in -inf and later in nan values, so avoid f16 with CUDA backend. With f32 there are no problems.

fairydreaming avatar Jun 26 '24 17:06 fairydreaming

This is really great work, and I'm excited to watch all the progress happening on this model architecture. I have one (hopefully) small request to support the ByT5 tokenizer for that family of T5 models. It has some unique use cases where it is really useful in things like multilingual tasks and text denoising.

Fortunately it is a very simple tokenizer that is mostly just utf-8 mapped directly to integers with some special tokens. https://github.com/huggingface/transformers/blob/main/src/transformers/models/byt5/tokenization_byt5.py

lmiller-phdata avatar Jun 27 '24 13:06 lmiller-phdata

The third (and final) PR is now merged. TODO some day: add support for encoder-decoder models in llama-server.

fairydreaming avatar Jul 04 '24 18:07 fairydreaming

The third (and final) PR is now merged. TODO some day: add support for encoder-decoder models in llama-server.

@abetlen @fairydreaming

Hi,

Does this code land in llama.dll, cause I use llama-cpp-python which uses llama.dll

Great job btw, thanks

Sadeghi85 avatar Jul 05 '24 00:07 Sadeghi85

Does this code land in llama.dll, cause I use llama-cpp-python which uses llama.dll

@Sadeghi85 I suppose the new code is there, but to use encoder-decoder models like T5 you have to use new API functions: llama_model_has_encoder(), llama_encode(), llama_model_decoder_start_token(). So I think you have to wait until the llama-cpp-python author (@abetlen) adds support for this.

fairydreaming avatar Jul 05 '24 06:07 fairydreaming