llama.cpp
llama.cpp copied to clipboard
Maybe it would better to have a diagram to show how llama.cpp process inferences
I'm using llama.cpp to deploy deepseek-r1-671B-Q4_0 weights, but I found documention/README.md is barely detailed; I even have to read the source to understand what would happen if I make some flag on. For example '--gpu-layers', according to code it would be a key for PP, but no word was put on the detail in the document, but it found no better performance when i make it greater than model tensor layers.
// TODO: move these checks to ggml_backend_sched // enabling pipeline parallelism in the scheduler increases memory usage, so it is only done when necessary bool pipeline_parallel = model->n_devices() > 1 && model->params.n_gpu_layers > (int)model->hparams.n_layer && model->params.split_mode == LLAMA_SPLIT_MODE_LAYER && params.offload_kqv;
it would highly appreciated if I could have a prcoessing diargam, better if it has some related flag attached to each node;
thanks all the way