llama.cpp Maybe it would better to have a diagram to show how llama.cpp process inferences

Maybe it would better to have a diagram to show how llama.cpp process inferences

Open yinuu opened this issue 2 days ago • 1 comments

I'm using llama.cpp to deploy deepseek-r1-671B-Q4_0 weights, but I found documention/README.md is barely detailed; I even have to read the source to understand what would happen if I make some flag on. For example '--gpu-layers', according to code it would be a key for PP, but no word was put on the detail in the document, but it found no better performance when i make it greater than model tensor layers. // TODO: move these checks to ggml_backend_sched // enabling pipeline parallelism in the scheduler increases memory usage, so it is only done when necessary bool pipeline_parallel = model->n_devices() > 1 && model->params.n_gpu_layers > (int)model->hparams.n_layer && model->params.split_mode == LLAMA_SPLIT_MODE_LAYER && params.offload_kqv; it would highly appreciated if I could have a prcoessing diargam, better if it has some related flag attached to each node;

thanks all the way

Feb 20 '25 07:02 yinuu

llama.cpp llama.cpp copied to clipboard

Maybe it would better to have a diagram to show how llama.cpp process inferences

llama.cpp
llama.cpp copied to clipboard