llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Refactor: Allow adding both tokens and embeddings to `llama_batch`

Open ngxson opened this issue 3 months ago • 3 comments

Background Description

Ref: https://github.com/ggerganov/llama.cpp/pull/7553 , required for supporting future vision models (https://github.com/ggerganov/llama.cpp/issues/8010)

I initially planned to make a proposal PR for this, but turns out it's quite more complicated than I thought. Therefore, I create this issue for further discussion before actually implement it.

Possible Refactor Approaches

The problem can be divided into 2 parts:

  1. How the llama_batch can be constructed?
  2. How the cgraph should be modified?

For the second part (How the cgraph should be modified?), it should be simple: llm_build_inp_embd can be modified to concat tensors from "learned" embd and input embd. The pseudo-code looks like this:

lctx.inp_tokens = ggml_new_tensor_1d(ctx, GGML_TYPE_I32, batch.n_tokens);
struct ggml_tensor * learned_embd = ggml_get_rows(ctx, tok_embd, lctx.inp_tokens);
struct ggml_tensor * inp_embd     = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, n_embd, batch.n_tokens);
inpL = ggml_concat(ctx, learned_embd, inp_embd);

The attention mask also need to be updated accordingly.


For the first part (How the llama_batch can be constructed?), the problem is that there are many different possible approach:

Proposal 1: Add n_embds to llama_batch

    typedef struct llama_batch {
        int32_t n_tokens;
        int32_t n_embds;

        llama_token  *  token;
        float        *  embd; // has n_embds * dim_embd elements
        int32_t      *  n_seq_id; // has n_tokens+n_embds elements
        ...
    }

The downside of this approach is that it's quite messy to keep track of n_seq_id, seq_id, logits

Proposal 2: Add an overload version of llama_decode/encode

llama_decode(struct llama_context * ctx, struct llama_batch * batch_tokens, struct llama_batch * batch_embd);

The downside would be that this is kinda a "hacky" (not intuitive for developers), because one batch is now represented by 2 llama_batch objects.

Proposal 3: Keep llama_batch the same, but tokens ID < 0 are embeddings

For example:

batch.token = { 1, 842, -1, -1, 242, -1 };
batch.embd = {
  -0.4, 0.2, 0.12,..., // correspond to batch.token[2]
  0.04, 0.02, 0.3,..., // correspond to batch.token[3]
  0.04, 0.1, -0.3,..., // correspond to batch.token[5]
};

This seems to be easier to implement than all other proposals. The only thing I'm not sure is that do we expect negative token ID to be a reserved use case?

Proposal 4: Completely refactor llama_batch to accept sequence list instead of token list

This is actually proposed by @slaren , but I'm not sure what it should look like in real world. Could you please explain it further?

I'm also tagging @ggerganov and @abetlen for futher discussion. Thank you!

ngxson avatar Nov 18 '24 11:11 ngxson