node-llama-cpp icon indicating copy to clipboard operation
node-llama-cpp copied to clipboard

feat: automatic batching

Open giladgd opened this issue 2 years ago • 8 comments

Also, automatically set the right contextSize and provide other good defaults to make the usage smoother.

  • Support configuring the context swapping size for infinite text generation (by default, it'll be automatic and dynamic depending on the prompt)

giladgd avatar Nov 05 '23 15:11 giladgd

:tada: This issue has been resolved in version 3.0.0-beta.1 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

github-actions[bot] avatar Nov 26 '23 19:11 github-actions[bot]

Just tried it and:

llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly Q4_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW)
llm_load_print_meta: general.name   = ehartford_dolphin-2.1-mistral-7b
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 32000 '<|im_end|>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: mem required  = 4165.48 MiB
...............................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 4096.00 MiB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 136707.31 MiB
GGML_ASSERT: D:\a\node-llama-cpp\node-llama-cpp\llama\llama.cpp\llama.cpp:745: data

Tried to allocate 135Gb ? :) Here's the test I used:

import {
    LlamaModel,
    LlamaContext,
    LlamaChatSession,
    ChatMLChatPromptWrapper
} from "node-llama-cpp";

const model = new LlamaModel({
    modelPath: path.join("model", "dolphin-2.1-mistral-7b.Q4_K_M.gguf"),
});

const defaultSystemPrompt = 'You are a exceptional professional senior coder specialized in javascript and python talking with a human, you have exceptional attention to detail and when writing code you write well commented code while also describing every step.';
const context = new LlamaContext({ model });
const session = new LlamaChatSession({ context, promptWrapper: new ChatMLChatPromptWrapper(), systemPrompt: defaultSystemPrompt, printLLamaSystemInfo: true });
const interact = async function(prompt, id, cb) {
    id = id || 'test';
    try {
        await session.prompt(prompt, {
            onToken(chunk) {
                    cb(context.decode(chunk));
            }
        })
    } catch (e) {
      console.error(e)
    }
    return;
}

let prompt = `
Hi, please write a multi dimentional sort algorithm and explain all the code
`
const cb = function(text) {
  process.stdout.write(text);
}
const test = async function() {
  await interact(prompt, null, cb)
}

await test()

carlosgalveias avatar Nov 28 '23 12:11 carlosgalveias

@carlosgalveias I don't think you've installed the beta version, since I've just tried the model you mentioned here and it worked perfectly for me.

Make sure you install it this way:

npm install node-llama-cpp@beta

To use the 3.0.0-beta.1 version, do something like this:

import {fileURLToPath} from "url";
import path from "path";
import {LlamaModel, LlamaContext, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const model = new LlamaModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = new LlamaContext({model});
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});


const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1);
console.log("AI: " + a1);


const q2 = "Summerize what you said";
console.log("User: " + q2);

const a2 = await session.prompt(q2);
console.log("AI: " + a2);

giladgd avatar Nov 29 '23 20:11 giladgd

@carlosgalveias I think I know what issue you have encountered. The 3.0.0-beta.1 version sets the context size by default to the context size the model was trained on, and if this number is too big, then it will allocate a lot of memory for the big context. I plan to make this library provide better defaults in one of the future betas, so for now, try to manually limit the context size of the model.

So for the 3.0.0-beta.1 version, do something like this:

import {fileURLToPath} from "url";
import path from "path";
import {LlamaModel, LlamaContext, LlamaChatSession} from "node-llama-cpp";

const __dirname = path.dirname(fileURLToPath(import.meta.url));

const model = new LlamaModel({
    modelPath: path.join(__dirname, "models", "dolphin-2.1-mistral-7b.Q4_K_M.gguf")
});
const context = new LlamaContext({
    model,
    contextSize: Math.min(4096, model.trainContextSize)
});
const session = new LlamaChatSession({
    contextSequence: context.getSequence()
});


const q1 = "Hi there, how are you?";
console.log("User: " + q1);

const a1 = await session.prompt(q1);
console.log("AI: " + a1);


const q2 = "Summerize what you said";
console.log("User: " + q2);

const a2 = await session.prompt(q2);
console.log("AI: " + a2);

giladgd avatar Dec 03 '23 23:12 giladgd

@giladgd gotcha, thanks!

carlosgalveias avatar Dec 05 '23 11:12 carlosgalveias

Will this change help with trying to do a lot of one-shot prompts in a loop? I'm seeing this error, could not find a KV slot for the batch (try reducing the size of the batch or increase the context) regardless of what I set the batch size or context size to. I am creating the model and context outside of the loop and creating a new session inside of the loop on every iteration. Since, these are one-shot prompts, I don't really care about llama2 having any session information.

Zambonilli avatar Dec 27 '23 23:12 Zambonilli

@Zambonilli Can you please open a bug issue for this? It'll help me to investigate and fix the issue. Please also include code I can run to reproduce the issue and a link to the specific model file you used.

giladgd avatar Jan 11 '24 17:01 giladgd

@giladgd no problem, here is the ticket.

Zambonilli avatar Jan 12 '24 21:01 Zambonilli

:tada: This PR is included in version 3.0.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

github-actions[bot] avatar Sep 24 '24 18:09 github-actions[bot]