llama.cpp
llama.cpp copied to clipboard
[Documentation] C API examples
Hey!
There should be a simple example on how to use the new C API (like one that simply takes a hardcoded string and runs llama on it until \n or something like that).
Not sure the the /examples/
directory is appropriate for this.
Thanks Niansa
Agreed. I'm planning to write some wrappers to port llama.cpp using the new llama.h to other languages and a documentation would be helpful. I am happy to look into writing an example for it if @ggerganov or anyone else isn't planning to do so.
@SpeedyCraftah go for it, here is a rough overview:
const std::string prompt = " This is the story of a man named ";
llama_context* ctx;
auto lparams = llama_context_default_params();
// load model
ctx = llama_init_from_file("../../llama.cpp/models/7B/ggml-model-q4_0.bin", lparams);
// determine the required inference memory per token:
// TODO: this is a hack copied from main.cpp idk whats up here
{
const std::vector<llama_token> tmp = { 0, 1, 2, 3 };
llama_eval(ctx, tmp.data(), tmp.size(), 0, N_THREADS);
}
// convert prompt to embedings
std::vector<llama_token> embd_inp(prompt.size()+1);
auto n_of_tok = llama_tokenize(ctx, prompt.c_str(), embd_inp.data(), embd_inp.size(), true);
embd_inp.resize(n_of_tok);
// evaluate the prompt
for (size_t i = 0; i < embd_inp.size(); i++) {
// batch size 1
llama_eval(ctx, embd_inp.data()+i, 1, i, N_THREADS);
}
std::string prediction;
std::vector<llama_token> embd = embd_inp;
for (number of tokens to predict) {
id = llama_sample_top_p_top_k(ctx, nullptr, 0, 40, 0.8f, 0.2f, 1.f/0.85f);
// TODO: break here if EOS
// add it to the context (all tokens, prompt + predict)
embd.push_back(id);
// add to string
prediction += llama_token_to_str(ctx, id);
// eval next token
llama_eval(ctx, &embd.back(), 1, embd.size(), N_THREADS);
}
llama_free(ctx); // cleanup
edit: removed the -1 from last eval
The ./examples
folder should contain all programs generated by the project.
For example, main.cpp
has to become an example in ./examples/main
.
The utils.h
and utils.cpp
have to be moved to ./examples
folder and be shared across all example.
See whisper.cpp examples structure for reference.
@SpeedyCraftah go for it, here is a rough overview: ...
edit: removed the -1 from last eval
Absolutely wonderful! This example alone was enough to make me understand how to use the API 👍
Can verify this works, note tho that you've mixed up tok
and id
.
Absolutely wonderful! This example alone was enough to make me understand how to use the API +1 Can verify this works, note tho that you've mixed up
tok
andid
.
:) yea i was just throwing stuff together from my own experiments and main.cpp
@SpeedyCraftah go for it, here is a rough overview:
const std::string prompt = " This is the story of a man named "; llama_context* ctx; auto lparams = llama_context_default_params(); // load model ctx = llama_init_from_file("../../llama.cpp/models/7B/ggml-model-q4_0.bin", lparams); // determine the required inference memory per token: // TODO: this is a hack copied from main.cpp idk whats up here { const std::vector<llama_token> tmp = { 0, 1, 2, 3 }; llama_eval(ctx, tmp.data(), tmp.size(), 0, N_THREADS); } // convert prompt to embedings std::vector<llama_token> embd_inp(prompt.size()+1); auto n_of_tok = llama_tokenize(ctx, prompt.c_str(), embd_inp.data(), embd_inp.size(), true); embd_inp.resize(n_of_tok); // evaluate the prompt for (size_t i = 0; i < embd_inp.size(); i++) { // batch size 1 llama_eval(ctx, embd_inp.data()+i, 1, i, N_THREADS); } std::string prediction; std::vector<llama_token> embd = embd_inp; for (number of tokens to predict) { id = llama_sample_top_p_top_k(ctx, nullptr, 0, 40, 0.8f, 0.2f, 1.f/0.85f); // TODO: break here if EOS // add it to the context (all tokens, prompt + predict) embd.push_back(id); // add to string prediction += llama_token_to_str(ctx, id); // eval next token llama_eval(ctx, &embd.back(), 1, embd.size(), N_THREADS); } llama_free(ctx); // cleanup
edit: removed the -1 from last eval
Can confirm it works, thank you. I was wondering why it generated tokens so slowly but today enabling compiler release optimisations fixed that. It is a CPU machine learning framework after all.
I will try to cook something simple and helpful and submit it.
Thank you for the instruction. It will be super helpful to have a minimal example of how to fire up the API, and import it from Python as a package. So one can send request (together with the generation parameters) to the API.
Thank you for the instruction. It will be super helpful to have a minimal example of how to fire up the API, and import it from Python as a package. So one can send request (together with the generation parameters) to the API.
Definitely instead of using the janky command line method of doing it and then extracting the outputs. I am planning to write a node-gyp binding for it so that you can directly run it via node.js.
@Green-Sky If you don't mind me asking, how do I go about increasing the batch size of the prompt? I tried something naive but it just seems to be resulting in undefined behaviour (I tried to set a batch of 8):
for (size_t i = 0; i < embd_inp.size(); i++) {
llama_eval(ctx, embd_inp.data() + (i * 8), 8, i * 8, N_THREADS);
}
Did I do something wrong or rather what I didn't do?
EDIT - Just realised I didn't then divide the loop length by 8 (yes, I will handle remainders don't worry). But it seems to be working now!
@SpeedyCraftah any update on this?
@SpeedyCraftah any update on this?
Going well! I am finished with the final mock-up, now just needs some polishing, size_t conversion warning fixes and comments, then it's ready to go, although it should be split up into multiple parts such as "example of barebones generation" and "example of generation with stop sequence" so it isn't so complex right off the bat. I also added stop sequences similar to how OpenAI does it - stops printing/saving tokens which appear to match the stop sequence at first, once it's confirmed it's not a stop sequence it replays all the tokens that weren't printed/saved as a result.
Only issue is that the time from loading the model to generating the first token is noticeably longer than when running the same parameters & prompt with the main.exe CLI. I'm also not sure if I implemented batching correctly, I kind of took a guess on how it might be implemented, should probably look at the main CLI for that.
Would be great if you could look over it first! https://paste.gg/p/anonymous/4440251201fd45d49d051a4d8661fee5