llama.cpp
llama.cpp copied to clipboard
llama : move the sampling API from common into llama lib
There is functionality around llama_sampling_context currently part of common. We should move it into llama. Pretty much the entire API from common/sampling.h except llama_sampling_params and llama_sampling_sample can be integrated into the library.
This would probably require to also merge the grammar parser into the llama lib implementation.
The llama_sampling_params and llama_sampling_sample will stay in common since they are very example-specific and not general-purpose enough to be merged.
Is this meant as a short-term stop gap measure? If we are going to add a new sampling API to llama.cpp, it would be good to do this from the ground up with the possibility of GPU sampling in mind. The implementation is sampling.h does not seem flexible enough to do this.
This change is more relevant for the CPU-based sampling. There are many use cases that require to manage a sampling state (e.g. previously sampled tokens, grammar state, etc.) so it makes sense to add support directly into the core library.
I haven't thought deeply about GPU sampling support. Wouldn't it make more sense to have a limited number of GPU sampling options (such as greedy and top-k) as part of llama_context since this would require changing the compute graph in the first place? I don't expect we can ever support GPU grammar sampling for example or even GPU repeat-penalty - is that a correct assumption?
It's clear that some samplers cannot have GPU implementations, however this doesn't mean that we need two different APIs for GPU and CPU sampling. We could define samplers as an abstract object, that may or may not contain a state, and that may contain ggml or CPU implementations. Then we would need to assemble a pipeline of sampler objects that can be run at the end of the model evaluation. If all the samplers contain ggml implementations, then it can run on the GPU, otherwise at least some parts would still run on the CPU. I think it is mostly a matter of designing a flexible enough interface.
Ok will give it further thought.
One way that comes to mind is something like this:
int32_t llama_decode_with_sampling(
struct llama_context * ctx,
struct llama_sampling_context * ctx_s,
struct llama_batch batch,
llama_token * result);
The llama_sampling_context can hold the information about the sampling pipeline together with the sampling state. So in that sense, merging llama_sampling_context in llama seems compatible with future GPU support - it just has to be extended with the samplers info.
It's also important to allow multiple evaluations to be queued together, that's one of the biggest advantages of GPU sampling. That can be done by making llama_decode_with_sampling asynchronous, however the token output result needs to be removed, since obtaining that requires flushing the queue.
Yes, this might get tricky when considering multiple sequences in the batch, but seems doable.
Let me know if you have other concerns about merging llama_sampling_context - seems like a step in the right direction even when considering GPU support.
If we do that, then the llama_sample_... API that currently exists in llama.h can be updated to take llama_sampling_context instead of candidates and last_tokens. This API will remain as a utility for the users to do manual sampling and can potentially be removed when the llama_decode_with_sampling gets fully implemented.
This issue is stale because it has been open for 30 days with no activity.
Not stale.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Ok will give it further thought.
One way that comes to mind is something like this:
int32_t llama_decode_with_sampling( struct llama_context * ctx, struct llama_sampling_context * ctx_s, struct llama_batch batch, llama_token * result);The
llama_sampling_contextcan hold the information about the sampling pipeline together with the sampling state. So in that sense, mergingllama_sampling_contextinllamaseems compatible with future GPU support - it just has to be extended with the samplers info.
So I better understand this approach, would this involve updating llama_build_graph to take in the sampling context and append it to the constructed model graph? Or alternatively some sort of "postprocessing graph" type that llama_decode_with_sampling could convert the sampling context to?
Resolved via #9294