llama.cpp
llama.cpp copied to clipboard
Refactor: Allow adding both tokens and embeddings to `llama_batch`
Background Description
Ref: https://github.com/ggerganov/llama.cpp/pull/7553 , required for supporting future vision models (https://github.com/ggerganov/llama.cpp/issues/8010)
I initially planned to make a proposal PR for this, but turns out it's quite more complicated than I thought. Therefore, I create this issue for further discussion before actually implement it.
Possible Refactor Approaches
The problem can be divided into 2 parts:
- How the
llama_batch
can be constructed? - How the cgraph should be modified?
For the second part (How the cgraph should be modified?), it should be simple: llm_build_inp_embd
can be modified to concat tensors from "learned" embd and input embd. The pseudo-code looks like this:
lctx.inp_tokens = ggml_new_tensor_1d(ctx, GGML_TYPE_I32, batch.n_tokens);
struct ggml_tensor * learned_embd = ggml_get_rows(ctx, tok_embd, lctx.inp_tokens);
struct ggml_tensor * inp_embd = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, n_embd, batch.n_tokens);
inpL = ggml_concat(ctx, learned_embd, inp_embd);
The attention mask also need to be updated accordingly.
For the first part (How the llama_batch
can be constructed?), the problem is that there are many different possible approach:
Proposal 1: Add n_embds
to llama_batch
typedef struct llama_batch {
int32_t n_tokens;
int32_t n_embds;
llama_token * token;
float * embd; // has n_embds * dim_embd elements
int32_t * n_seq_id; // has n_tokens+n_embds elements
...
}
The downside of this approach is that it's quite messy to keep track of n_seq_id
, seq_id
, logits
Proposal 2: Add an overload version of llama_decode/encode
llama_decode(struct llama_context * ctx, struct llama_batch * batch_tokens, struct llama_batch * batch_embd);
The downside would be that this is kinda a "hacky" (not intuitive for developers), because one batch is now represented by 2 llama_batch
objects.
Proposal 3: Keep llama_batch
the same, but tokens ID < 0 are embeddings
For example:
batch.token = { 1, 842, -1, -1, 242, -1 };
batch.embd = {
-0.4, 0.2, 0.12,..., // correspond to batch.token[2]
0.04, 0.02, 0.3,..., // correspond to batch.token[3]
0.04, 0.1, -0.3,..., // correspond to batch.token[5]
};
This seems to be easier to implement than all other proposals. The only thing I'm not sure is that do we expect negative token ID to be a reserved use case?
Proposal 4: Completely refactor llama_batch
to accept sequence list instead of token list
This is actually proposed by @slaren , but I'm not sure what it should look like in real world. Could you please explain it further?
I'm also tagging @ggerganov and @abetlen for futher discussion. Thank you!