Overview

This is a list of changes to the public interface of the llama library. Collaborators are encouraged to edit this post in order to reflect important changes to the API that end up merged into the master branch.

If you are building a 3rd party project that relies on libllama, it is recommended to follow this issue and check it before upgrading to new versions.

Recent API changes (most recent at the top)

version	PR	desc
TBD	#13284	Update `llama_context_params`
b5125	#12511	Update `llama_model_quantize_params`
b5028	#11397	Update `llama_model_params`
b4882	#12181	Change `llama_kv_cache_...` -> `llama_kv_self_...`
b4599	#9639	Add llama_sampler_init_grammar_lazy to support lazy grammars w/ trigger words & tokens
b4524	#11016	Add name parameter to llama_model_chat_template (uses default template if NULL)
b4501	#11262	Remove `rpc_servers` from `llama_model` and `llama_model_params`
b4464	#11110	Add `llama_vocab` and rename various structs and calls
b4424	#11063	Update `llama_model` API naming
b4357	#10784	Remove `llama_model_get_tensor()`
b4337	#10803	Change `llama_sampler_init_penalties()`
b4282	#10446	Remove support for `Q4_0_N_M` model files in favor of automatic repacking of `Q4_0`
b4167	#10497	Add `devices` to `llama_model_params`
b3948	#9897	Deprecate `softmax` sampler and update `dist` sampler`
b3988	#10071	Remove Tail-Free sampling
b3943	#9745	Remove `all_pos_0, all_pos_1, all_seq_id` from `llama_batch`
b3908	#9798	Update FIM-related API
b3841	#9510	Add `LLAMA_POOLING_TYPE_RANK`
b3774	#9512	Add `llama_n_head()`
b3750	#9355	Add `llama_perf` API + param to disable internal profiling
b3749	#9445	Add `llama_sampler_chain_remove()`
b3681	#9294	Major changes to the sampling API (see PR for more info)
b3651	#8980	Add `LLAMA_VOCAB_TYPE_RWKV` enum value
b3644	#8672	Add `llama_threadpool` API + change `uint32_t` -> `int32_t`
b3614	#8526	Add `llama_model_is_recurrent`

For older changes, use:

git log --oneline -p b3614 -- include/llama.h

(For collaborators) To link between PR number vs Build number:

git log --oneline | tail -r | nl

Upcoming API changes

TBD

Sep 03 '24 06:09 ggerganov

#9355 restores the functionality for getting performance measurements from within libllama (which was removed in #9294) via a new llama_perf API. The llama_context_params is extended with a new bool no_perf parameter that can be used to disable the internal timings during libllama compute.

Sep 13 '24 06:09 ggerganov

Looks like llama_model_get_tensor was removed from the API but that change was not documented here

Jan 03 '25 18:01 ddh0

Looks like llama_model_get_tensor was removed from the API but that change was not documented here

I didn't expect that this function is being used by anyone, so I skipped updating the changelog. It's updated now.

Btw, what do you use this call for?

Jan 04 '25 14:01 ggerganov

I didn't expect that this function is being used by anyone, so I skipped updating the changelog. It's updated now.

Btw, what do you use this call for?

I don't use it personally but the function was included in my Python code, I started to get ctypes "symbol not found" errors and I had to do some digging to figure out why. No worries!

Jan 05 '25 00:01 ddh0

Significant updates to the public API - added struct llama_vocab and applied multiple naming changes for better consistency: #11110. Update of user code should be relatively easy - mostly changing functions to use the new names and calling llama_model_get_vocab(model) where the old API required llama_model and the new API requires llama_vocab.

Jan 12 '25 09:01 ggerganov

Not actually related to the core library, but we're turn on LLAMA_CURL by default, link to the PR here. If you're building llama.cpp in your project just to get the libllama, make sure to disable either LLAMA_CURL or LLAMA_BUILD_COMMON

Apr 07 '25 11:04 ngxson

PR #12511 changed the llama_model_quantize_params API by introducing an additional void * tensor_types parameter

Apr 08 '25 20:04 EAddario

PR #11397 added llama_model_tensor_buft_override to llama_model_params, causing a segmentation fault when trying to load a llama model due to this line in src/llama-model.cpp:

    pimpl->has_tensor_overrides = params.tensor_buft_overrides && params.tensor_buft_overrides[0].pattern;

Because I did not change the way I was loading the model after that PR, I got a segfault because I didn't include the new llama_model_tensor_buft_override in my llama_model_params.

Apr 09 '25 04:04 ddh0

@ddh0 We have missed to document this change. I added an entry to the table above. Thanks.

Apr 09 '25 07:04 ggerganov

Overview

This is a list of changes to the public interface of the llama library. Collaborators are encouraged to edit this post in order to reflect important changes to the API that end up merged into the master branch.

If you are building a 3rd party project that relies on libllama, it is recommended to follow this issue and check it before upgrading to new versions.

See also:

Changelog for llama-server REST API

Recent API changes (most recent at the top)

version PR desc

TBD. #14631 Remove enum llama_vocab_pre_type

b5435 #13653 Remove llama_kv_cache_view_* API

b5740 #13037 Update llama_model_quantize_params

b5429 #13194 Update llama_context_params - add bool swa_full

b5311 #13284 Update llama_context_params - remove logits_all + rearrange flags

b5125 #12511 Update llama_model_quantize_params

b5028 #11397 Update llama_model_params

b4882 #12181 Change llama_kv_cache_... -> llama_kv_self_...

b4599 #9639 Add llama_sampler_init_grammar_lazy to support lazy grammars w/ trigger words & tokens

b4524 #11016 Add name parameter to llama_model_chat_template (uses default template if NULL)

b4501 #11262 Remove rpc_servers from llama_model and llama_model_params

b4464 #11110 Add llama_vocab and rename various structs and calls

b4424 #11063 Update llama_model API naming

b4357 #10784 Remove llama_model_get_tensor()

b4337 #10803 Change llama_sampler_init_penalties()

b4282 #10446 Remove support for Q4_0_N_M model files in favor of automatic repacking of Q4_0

b4167 #10497 Add devices to llama_model_params

b3948 #9897 Deprecate softmax sampler and update dist sampler`

b3988 #10071 Remove Tail-Free sampling

b3943 #9745 Remove all_pos_0, all_pos_1, all_seq_id from llama_batch

b3908 #9798 Update FIM-related API

b3841 #9510 Add LLAMA_POOLING_TYPE_RANK

b3774 #9512 Add llama_n_head()

b3750 #9355 Add llama_perf API + param to disable internal profiling

b3749 #9445 Add llama_sampler_chain_remove()

b3681 #9294 Major changes to the sampling API (see PR for more info)

b3651 #8980 Add LLAMA_VOCAB_TYPE_RWKV enum value

b3644 #8672 Add llama_threadpool API + change uint32_t -> int32_t

b3614 #8526 Add llama_model_is_recurrent

For older changes, use:
git log --oneline -p b3614 -- include/llama.h
(For collaborators) To link between PR number vs Build number:
git log --oneline | tail -r | nl
Upcoming API changes

TBD

version	PR	desc
TBD.	#14631	Remove `enum llama_vocab_pre_type`
b5435	#13653	Remove `llama_kv_cache_view_*` API
b5740	#13037	Update `llama_model_quantize_params`
b5429	#13194	Update `llama_context_params` - add `bool swa_full`
b5311	#13284	Update `llama_context_params` - remove `logits_all` + rearrange flags
b5125	#12511	Update `llama_model_quantize_params`
b5028	#11397	Update `llama_model_params`
b4882	#12181	Change `llama_kv_cache_...` -> `llama_kv_self_...`
b4599	#9639	Add llama_sampler_init_grammar_lazy to support lazy grammars w/ trigger words & tokens
b4524	#11016	Add name parameter to llama_model_chat_template (uses default template if NULL)
b4501	#11262	Remove `rpc_servers` from `llama_model` and `llama_model_params`
b4464	#11110	Add `llama_vocab` and rename various structs and calls
b4424	#11063	Update `llama_model` API naming
b4357	#10784	Remove `llama_model_get_tensor()`
b4337	#10803	Change `llama_sampler_init_penalties()`
b4282	#10446	Remove support for `Q4_0_N_M` model files in favor of automatic repacking of `Q4_0`
b4167	#10497	Add `devices` to `llama_model_params`
b3948	#9897	Deprecate `softmax` sampler and update `dist` sampler`
b3988	#10071	Remove Tail-Free sampling
b3943	#9745	Remove `all_pos_0, all_pos_1, all_seq_id` from `llama_batch`
b3908	#9798	Update FIM-related API
b3841	#9510	Add `LLAMA_POOLING_TYPE_RANK`
b3774	#9512	Add `llama_n_head()`
b3750	#9355	Add `llama_perf` API + param to disable internal profiling
b3749	#9445	Add `llama_sampler_chain_remove()`
b3681	#9294	Major changes to the sampling API (see PR for more info)
b3651	#8980	Add `LLAMA_VOCAB_TYPE_RWKV` enum value
b3644	#8672	Add `llama_threadpool` API + change `uint32_t` -> `int32_t`
b3614	#8526	Add `llama_model_is_recurrent`

Jul 14 '25 18:07 Art39print

PR #16382 updated the libllama interface to add LLAMA_API bool llama_model_is_hybrid(...).

Oct 03 '25 19:10 ddh0

PR #16310 added new param "--no-host" to disable host buffer to allow extra buffers (Repack + AMX) to enable AMX acceleration on CPU layers when GPU(s) is present.

Oct 06 '25 21:10 Gadflyii

changelog : `libllama` API

Overview

Recent API changes (most recent at the top)

Upcoming API changes

Overview

Recent API changes (most recent at the top)

Upcoming API changes