Zoli Somogyi
Zoli Somogyi
[bmazzarol](https://github.com/bmazzarol), Let's assume your benchmark is accurate and it takes 22ms to create EmbedBatch1. Of that time, the majority is likely spent on generating the embedding itself, so we can...
I would expect that the model keeps its kvcache GPU memory space and that it just needs to reset it without needing to reallocate it. The model should not need...
Everything is kept after loading the model. The kvcache is being allocated every time I do inference. This is the problem. You can test it by using 2 models which...
I have created a test program and I could further narrow down the problem. So, the crash occurs when you load model A, but do not use it immediately, but...
No, you forgot to mention the last step when it allocates GPU memory once more, when the model is used for the first time. This is not as expected. What...
ggml-cuda.cu crashes after this: ``` llama.dll!llama_graph_compute(llama_context & lctx, ggml_cgraph * gf, int n_threads) Line 11094 C++ llama.dll!llama_decode_internal(llama_context & lctx, llama_batch batch_all) Line 11336 C++ llama.dll!llama_decode(llama_context * ctx, llama_batch batch) Line...
The GPU memory handling is a sensitive issue at llama.cpp, I have not got an answer to my question why my models use 20% more GPU memory today compared to...
I have further investigated the issue and the library crashes even when using only 1 model which does not fully fit into the GPU memory. The problem is the additional...
This is exactly one of the reasons why you should compile the code yourself and don't use pre-compiled packages. Even if you find the missing DLLs now, the problem could...
Please check out the comments in https://github.com/SciSharp/LLamaSharp/issues/1259 This is not a bug, but efficient context handling and we really need it like this as standard behavior. You will need a...