llama.cpp [Documentation] C API examples

Hey!

There should be a simple example on how to use the new C API (like one that simply takes a hardcoded string and runs llama on it until \n or something like that). Not sure the the /examples/ directory is appropriate for this.

Thanks Niansa

Mar 22 '23 08:03 niansa

Agreed. I'm planning to write some wrappers to port llama.cpp using the new llama.h to other languages and a documentation would be helpful. I am happy to look into writing an example for it if @ggerganov or anyone else isn't planning to do so.

Mar 22 '23 18:03 SpeedyCraftah

@SpeedyCraftah go for it, here is a rough overview:

const std::string prompt = " This is the story of a man named ";
llama_context* ctx;

auto lparams = llama_context_default_params();

// load model
ctx = llama_init_from_file("../../llama.cpp/models/7B/ggml-model-q4_0.bin", lparams);

// determine the required inference memory per token:
// TODO: this is a hack copied from main.cpp idk whats up here
{
    const std::vector<llama_token> tmp = { 0, 1, 2, 3 };
    llama_eval(ctx, tmp.data(), tmp.size(), 0, N_THREADS);
}

// convert prompt to embedings
std::vector<llama_token> embd_inp(prompt.size()+1);
auto n_of_tok = llama_tokenize(ctx, prompt.c_str(), embd_inp.data(), embd_inp.size(), true);
embd_inp.resize(n_of_tok);

// evaluate the prompt
for (size_t i = 0; i < embd_inp.size(); i++) {
	// batch size 1
	llama_eval(ctx, embd_inp.data()+i, 1, i, N_THREADS);
}

std::string prediction;
std::vector<llama_token> embd = embd_inp;

for (number of tokens to predict) {
    id = llama_sample_top_p_top_k(ctx, nullptr, 0, 40, 0.8f, 0.2f, 1.f/0.85f);

    // TODO: break here if EOS

    // add it to the context (all tokens, prompt + predict)
    embd.push_back(id);

    // add to string
    prediction += llama_token_to_str(ctx, id);

    // eval next token
    llama_eval(ctx, &embd.back(), 1, embd.size(), N_THREADS);
}

llama_free(ctx); // cleanup

edit: removed the -1 from last eval

Mar 22 '23 19:03 Green-Sky

The ./examples folder should contain all programs generated by the project. For example, main.cpp has to become an example in ./examples/main. The utils.h and utils.cpp have to be moved to ./examples folder and be shared across all example.

See whisper.cpp examples structure for reference.

Mar 22 '23 21:03 ggerganov

@SpeedyCraftah go for it, here is a rough overview: ...

edit: removed the -1 from last eval

Absolutely wonderful! This example alone was enough to make me understand how to use the API 👍 Can verify this works, note tho that you've mixed up tok and id.

Mar 23 '23 14:03 niansa

Absolutely wonderful! This example alone was enough to make me understand how to use the API +1 Can verify this works, note tho that you've mixed up tok and id.

:) yea i was just throwing stuff together from my own experiments and main.cpp

Mar 23 '23 14:03 Green-Sky

@SpeedyCraftah go for it, here is a rough overview:

const std::string prompt = " This is the story of a man named ";
llama_context* ctx;

auto lparams = llama_context_default_params();

// load model
ctx = llama_init_from_file("../../llama.cpp/models/7B/ggml-model-q4_0.bin", lparams);

// determine the required inference memory per token:
// TODO: this is a hack copied from main.cpp idk whats up here
{
    const std::vector<llama_token> tmp = { 0, 1, 2, 3 };
    llama_eval(ctx, tmp.data(), tmp.size(), 0, N_THREADS);
}

// convert prompt to embedings
std::vector<llama_token> embd_inp(prompt.size()+1);
auto n_of_tok = llama_tokenize(ctx, prompt.c_str(), embd_inp.data(), embd_inp.size(), true);
embd_inp.resize(n_of_tok);

// evaluate the prompt
for (size_t i = 0; i < embd_inp.size(); i++) {
	// batch size 1
	llama_eval(ctx, embd_inp.data()+i, 1, i, N_THREADS);
}

std::string prediction;
std::vector<llama_token> embd = embd_inp;

for (number of tokens to predict) {
    id = llama_sample_top_p_top_k(ctx, nullptr, 0, 40, 0.8f, 0.2f, 1.f/0.85f);

    // TODO: break here if EOS

    // add it to the context (all tokens, prompt + predict)
    embd.push_back(id);

    // add to string
    prediction += llama_token_to_str(ctx, id);

    // eval next token
    llama_eval(ctx, &embd.back(), 1, embd.size(), N_THREADS);
}

llama_free(ctx); // cleanup

edit: removed the -1 from last eval

Can confirm it works, thank you. I was wondering why it generated tokens so slowly but today enabling compiler release optimisations fixed that. It is a CPU machine learning framework after all.

I will try to cook something simple and helpful and submit it.

Mar 23 '23 21:03 SpeedyCraftah

Thank you for the instruction. It will be super helpful to have a minimal example of how to fire up the API, and import it from Python as a package. So one can send request (together with the generation parameters) to the API.

Mar 23 '23 21:03 jasontian6666

Thank you for the instruction. It will be super helpful to have a minimal example of how to fire up the API, and import it from Python as a package. So one can send request (together with the generation parameters) to the API.

Definitely instead of using the janky command line method of doing it and then extracting the outputs. I am planning to write a node-gyp binding for it so that you can directly run it via node.js.

Mar 23 '23 21:03 SpeedyCraftah

@Green-Sky If you don't mind me asking, how do I go about increasing the batch size of the prompt? I tried something naive but it just seems to be resulting in undefined behaviour (I tried to set a batch of 8):

for (size_t i = 0; i < embd_inp.size(); i++) {
    llama_eval(ctx, embd_inp.data() + (i * 8), 8, i * 8, N_THREADS);
}

Did I do something wrong or rather what I didn't do?

EDIT - Just realised I didn't then divide the loop length by 8 (yes, I will handle remainders don't worry). But it seems to be working now!

Mar 23 '23 22:03 SpeedyCraftah

@SpeedyCraftah any update on this?

Mar 24 '23 19:03 Green-Sky

@SpeedyCraftah any update on this?

Going well! I am finished with the final mock-up, now just needs some polishing, size_t conversion warning fixes and comments, then it's ready to go, although it should be split up into multiple parts such as "example of barebones generation" and "example of generation with stop sequence" so it isn't so complex right off the bat. I also added stop sequences similar to how OpenAI does it - stops printing/saving tokens which appear to match the stop sequence at first, once it's confirmed it's not a stop sequence it replays all the tokens that weren't printed/saved as a result.

Only issue is that the time from loading the model to generating the first token is noticeably longer than when running the same parameters & prompt with the main.exe CLI. I'm also not sure if I implemented batching correctly, I kind of took a guess on how it might be implemented, should probably look at the main CLI for that.

Would be great if you could look over it first! https://paste.gg/p/anonymous/4440251201fd45d49d051a4d8661fee5

Mar 24 '23 19:03 SpeedyCraftah

llama.cpp llama.cpp copied to clipboard

[Documentation] C API examples

llama.cpp
llama.cpp copied to clipboard