llama.cpp sampling : add support for GPU sampling (wip)

This is a work in progress to add support for GPU sampling.

The motivation for this feature is to enable sampling to be performed directly on the GPU as part of the computation graph being executed, allowing for some or all of the sampling to be done on the GPU.

For example, the GPU sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory.

It is also possible for the GPU samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers.

Currently the GPU sampling works in a similar manner to how pooling works, it is a function that is called by build_graph and the sampler operations become part of the models computation graph.

GPU samplers can be configured by creating sampler chains, where each sampler chain is associated with a specific sequence id:

    struct llama_sampler_chain_params params = llama_sampler_chain_default_params();
    struct llama_sampler * chain = llama_sampler_chain_init(params);
    llama_sampler_chain_add(chain, llama_sampler_gpu_init_greedy());
    std::vector<llama_sampler_seq_config> sampler_configs = {
        { 0, gpu_sampler_chain }
    };

The struct is defined as:

    struct llama_sampler_seq_config {
        llama_seq_id           seq_id;
        struct llama_sampler * sampler;
    };

These sampler configs are then passed as context params:

    llama_context_params cparams = llama_context_default_params();
    cparams.samplers = sampler_configs.data();
    cparams.n_samplers = sampler_configs.size();

When the model graph is built the GPU samplers will be called to enable them to add their operations to the graph:

ggml_cgraph * llama_model::build_graph(const llm_graph_params & params) const {
    std::unique_ptr<llm_graph_context> llm;
    ...

    // add GPU sampling layers (if any)
    llm->build_sampling(*this, params);

The llama_sampler_i interface as been extended with 4 new methods in the API, and they are currently all named with a _ggml suffix to indicate that they are for GPU sampling (and possibly other devices like NPUs in the future):

        void                   (*init_ggml)(struct llama_sampler      * smpl,
                                            ggml_backend_buffer_type_t  buft);

        void                   (*set_input_ggml)( struct llama_sampler * smpl,
                                                       ggml_context * ctx,
                                                        ggml_cgraph * gf);

        void                   (*apply_ggml)(  struct llama_sampler * smpl,
                                                       ggml_context * ctx,
                                                        ggml_cgraph * gf,
                                            llama_sampler_ggml_data * ggml_data);

        void                   (*accept_ggml)( struct llama_sampler * smpl,
                                                       ggml_context * ctx,
                                                        ggml_cgraph * gf,
                                               struct ggml_tensor * selected_token);

The init_ggml function allows GPU samplers to create input tensors that they might need. The ggml_backend_buffer_type should be used so that the tensors are created using this backend buffer type, which is the same as the ouput logits backend. This avoids splits in the computation graph that would require data transfer between different backends.

The set_input_ggml function is called after the computation graph has been scheduled but before it is computed. This allows the GPU sampler to set any input for the tensors it created in init_ggml.

The apply_ggml function is where the GPU sampler adds its operations to the graphs. When the graph is built, the configured sampler's _apply function is called which allows them to add operations/nodes to the computation graph.

The accept_ggml functions allows GPU samplers to update their tensor states if needed.

This enables the sampling to happen fully, or partially on the GPU. The samplers could sample a single token in which case that is what will be transferred from the device memory to host memory after llama_decode has been called. The sampled token can then be retrieved using:

    llama_token id = llama_get_sampled_token_ith(test_ctx.ctx, index);

Is it also possible to run a GPU sampler that only filters the logits and then only the filtered logits are transferred back to the host and the sampling can proceed on the CPU with the normal (CPU) sampler chain. In this case the CPU samplers are configured as usual but they will now operate on already filtered logits.

Similar to the above handling of logits, it is possible for a GPU samplers to compute the full probability distribution and transfer that to the host. And the CPU samplers can then operate on the those probabilities.

Building and running the tests

Download a model for testing:

$ cd models && wget https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories15M-q4_0.gguf

Building the test:

$ cmake --build build --target test-gpu-sampling -j8

Runing all tests:

$ env LLAMACPP_TEST_MODELFILE=../models/stories15M-q4_0.gguf \
    ctest --test-dir build -R '^test-gpu-sampling$' -V

The following individual tests are available:

$ ctest --test-dir build -N -R test-gpu-sampling-
  Test 35: test-gpu-sampling-greedy
  Test 36: test-gpu-sampling-temp
  Test 38: test-gpu-sampling-top_k
  Test 40: test-gpu-sampling-mul_seq

Total Tests: 6

These can be run individually, for example:

$ env LLAMACPP_TEST_MODELFILE=../models/stories15M-q4_0.gguf \
    ctest --test-dir build -R 'test-gpu-sampling-temp' -V

llama-cli

Initial support for llama-cli has been added and can be used as follows:

$ export GGML_SCHED_DEBUG=2
$ ./build/bin/llama-cli -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \
    -p "What is the Capital of Sweden?" \
    --gpu-sampling \
    --gpu-dist \
    -ngl 99 \
    -no-cnv \
    -n 20 \
    --no-warmup

(To print the backend schedulers assignments add -v/--verbose to the above command in combination with GGML_SCHED_DEBUG)

llama-server

GPU sampling can be enabled using the following global configuration command line options:

$ ./build-gpu-sampling/bin/llama-server --help
...
----- sampling params -----
...
--gpu-sampling                          enable GPU sampling (default: disabled)
--gpu-dist                              perform final sampling on GPU (default: disabled)

Usage:

$ export GGML_SCHED_DEBUG=2
$ ./build/bin/llama-server \
      -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \
      --gpu-sampling \
      --temp 0.8 \
      --top-k 40 \
      -ngl 50

(To print the backend schedulers assignments add -v/--verbose to the above command in combination with GGML_SCHED_DEBUG)

It is then possible to specify send GPU request parameters as follows:

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "What is the capital of Sweden?","n_predict": 20, "top_k": 40, "gpu_dist": true}'

The gpu_dist option will cause the dist GPU sampler to sample a token. Without setting this the CPU samplers will be able to process the filtered tokens that GPU sampler produced. This currently needs more work on the CPU samplers side to work.

To enable testing with the webui, the following settings have been added: gpu-settings

TODO

[x] Allocate GPU sampler tensors on the same backend as the logits (dev_output.dev)
[x] Allow GPU samplers to pre-allocate state tensors
[x] Integrate GPU samplers with llama-cli
[x] Set/unset GPU samplers
[x] Integrate GPU samplers with llama-server
[x] Add more tests/assertions for the gpu samplers to check more cases
[ ] penalties samplers (to figure out/verify how accept_ggml should work)
[ ] Add support for operations like ggml_top_k (support vocabulary size tensors) in all backends
[ ] Add ggml_cumsum operation to all backends:
- [x] CPU implementation exists.
- [ ] CUDA implementation of ggml_cumsum (partial implemented by me, but needs someone with more CUDA knowledge (so pretty much anyone) to implements this properly).
[ ] Consistent and clearer naming of GPU (device sampling) functions and data types.

Implemented GPU samplers

[x] temp
[x] logit_bias
[x] top_k (Not fully supported on all backends, see note below regarding argsort)
[x] greedy
[x] dist sampler

Remaining GPU samplers

The list below are the current CPU sampler that exist. All of these might not be appropriate as GPU samplers.

[ ] top_p
[ ] min_p
[ ] typical
[ ] temp_ext
[ ] xtc
[ ] top_n_sigma
[ ] mirostat/mirostat_v2
[ ] penalties
[ ] dry
[ ] infill

I think we should have support in all backends for the operations that the GPU samplers use. At the moment this is not the case and currently if the target backend device (the same device that holds the logits tensor) does not support the operation a warning is printed similar to this:

Warning: backend does not support argsort operation required for top-k sampling
CPU backend will be used instead which defeats the purpose of having GPU samplers

📝 Note: ARGSORT is not supported for arbitrary column width on Metal at the moment
       case GGML_OP_ARGSORT:
           // TODO: Support arbitrary column width
           return op->src[0]->ne[0] <= 1024;
So on macos, samplers that use ARGSORT currenty don't work. And for GPU samplers the dimension can as large as the model vocab size, for example:
(lldb) p op->src[0]->ne[0]
(int64_t) 32000

Nov 04 '25 17:11 danbev

One place this would be useful immediately is the diffusion-cli. I'm happy to test this when it's ready

Nov 05 '25 09:11 am17an

Not sure if I have a strong opinion on this but removing hybrid sampling would reduce the complexity a bit I think (basically if we always set --gpu-dist we only have two states (either full gpu sampling or full cpu sampling, and no in-between).

My thoughts are that I think we should keep the hybrid approach even though it does come with some additional complexity like you say. I think there could be use cases where one might want to perform some sampling like temp/logit_bias/top-k sampling on the device, and then only have a smaller set of logits copied to the host memory, and still enable other CPU samplers, including grammars, to be able to process the logits.

This might turn out to be an incorrect assumption and not something anyone wants to use, but it feels safer to have the ability do hybrid sampling to play it safe.

Nov 13 '25 06:11 danbev

@danbev Let's rebase on latest master to pick up the recent changes.

Nov 14 '25 07:11 ggerganov

Would it be alright to limit the scope of this PR to only the following backend samplers:

logit bias
temperature
top_k
greedy
distribution

And then we can add more samplers in follow up PRs?

If this sounds alright, I'll remove the commits that we don't want to include at this point, and squash others to make it easier to review. And then make this a non-draft PR for some more reviews.

Nov 17 '25 11:11 danbev

I think it is OK to implement the "accept" functionality (together with the penalties) in a separate PR since it brings extra complexity.

Nov 17 '25 11:11 ggerganov

Based on some llama-cli-based benching I did in 26be108 I feel the timings reported by llama_perf_context_print may be off.

For optimized argsort, we get

llama_perf_sampler_print:    sampling time =       0.25 ms /   207 runs   (    0.00 ms per token, 828000.00 tokens per second)
llama_perf_context_print:        load time =   18366.92 ms
llama_perf_context_print: prompt eval time =      35.92 ms /     7 tokens (    5.13 ms per token,   194.87 tokens per second)
llama_perf_context_print:        eval time =     532.79 ms /   199 runs   (    2.68 ms per token,   373.50 tokens per second)
llama_perf_context_print:       total time =     683.65 ms /   206 tokens
llama_perf_context_print:    graphs reused =        198

For non-optimized argsort

llama_perf_sampler_print:    sampling time =       0.25 ms /   207 runs   (    0.00 ms per token, 824701.20 tokens per second)
llama_perf_context_print:        load time =   18215.58 ms
llama_perf_context_print: prompt eval time =      28.20 ms /     7 tokens (    4.03 ms per token,   248.19 tokens per second)
llama_perf_context_print:        eval time =     714.79 ms /   199 runs   (    3.59 ms per token,   278.40 tokens per second)
llama_perf_context_print:       total time =     857.62 ms /   206 tokens
llama_perf_context_print:    graphs reused =        198

and for CPU-sampling

llama_perf_sampler_print:    sampling time =      19.57 ms /   207 runs   (    0.09 ms per token, 10579.58 tokens per second)
llama_perf_context_print:        load time =   18254.54 ms
llama_perf_context_print: prompt eval time =      23.96 ms /     7 tokens (    3.42 ms per token,   292.10 tokens per second)
llama_perf_context_print:        eval time =     529.06 ms /   199 runs   (    2.66 ms per token,   376.14 tokens per second)
llama_perf_context_print:       total time =     914.23 ms /   206 tokens
llama_perf_context_print:    graphs reused =        198

Basically total time is behaving as expected, but I'd have thought sampling time + prompt eval time + eval time to come somewhat close to it. This gap is especially large for CPU-based sampling

Nov 18 '25 18:11 ORippler

@danbev 7e98ebc might have introduced a bug - I'm getting gibberish with backend sampling disabled.

I'd have thought sampling time + prompt eval time + eval time to come somewhat close to it.

@ORippler They should. Is the CPU-sampling gap so large even on master?

Nov 19 '25 09:11 ggerganov

@danbev 7e98ebc might have introduced a bug - I'm getting gibberish with backend sampling disabled.

Sorry about that, I'll look into it.

It should be producing normal output now, but I think I found another bug. Sometimes llama-cli will output [end of text] directly which out sampling anything, and this can happen with and without backend sampler enabled. I'm looking into this now. Update This also happens on master so it might not be directly related to this PR.

Nov 19 '25 09:11 danbev

@ORippler They should. Is the CPU-sampling gap so large even on master?

Order in below is total, eval, prompt eval, sampling p=7, n=200 on 26be108

>>> 914 - 529 - 24 - 19
342 (37%)

p=7,n=1000 on 26be108

>>> 3991 - 2631 - 23 - 92
1245 (31%)

p=7, n=200 on 6fd4f9536

>>> 713 - 527 - 24 - 18
144 (20%)

p=7, n=1000 on 6fd4f9536

>>> 3039 - 2640 - 23.6 - 94
281.4 (9%)

Timings are consistent across llama-cli invocations. Feels like we are missing something on both master and this PR ( though for this PR it scales linearly).

Nov 19 '25 11:11 ORippler

Anything speaking against reserving appropriate buffers (and reusing n_output in case of --backend_dist) in llama_context::output_reserve when --backend_sampling and --backend_dist are specified?

I've tried to address this in https://github.com/ggml-org/llama.cpp/pull/17004/commits/61ffe41dc1cdafb2c71b650c4265c22ec769f88b. I still see some paged memory usage in nsys, but as far as i can tell these are being done in llama_model_loader::load_all_data. I'll go through this again on Monday and try to verify/test further.

Nov 21 '25 15:11 danbev

With the latest changes, samplers are allowed to run in the backend if:

Either they implement backend sampling with the backend_apply API
Or the provided parameters result in an empty sampler (i.e. noop)

For example running llama-cli with default sampling settings will result in a full-backend chain:

sampler chain: *logits -> *penalties? -> *dry? -> *top-n-sigma? -> *top-k -> *typical? -> *top-p -> *min-p -> *xtc? -> *temp-ext -> *dist

The * symbol indicates that the sampler runs on the backend
The ? symbol indicates that the sampler is "empty" (i.e. noop)

These changes should make the usage of the backend sampling functionality more seamless to the user and allow gradual support for backend samplers to be introduced.

Dec 01 '25 16:12 ggerganov