Diego Devesa comments

Results 361 comments of


                                            Diego Devesa

perf(CuBLAS): explore reduction in launch overhead via CUDA graphs

I don't think that we launch enough kernels for this to make a meaningful difference.

New optimization from NVIDIA to use CUDA Graphs in llama.cpp

Hi @agray3, you can fork the project in github, and push the branch to your fork. Then you will have the option to open a PR from the changes in...

Investigate PagedAttention KV-cache memory management for faster inference

We allocate all the KV memory required for the maximum context length on startup in one block, so we shouldn't have any fragmentation either.

fix(avx): workaround for missing _mm256_set_m128i in GCC < 8

I think the ifdefs are unnecessary because both compile to the same instructions ([see this in godbolt](https://godbolt.org/z/nvhv1bT1v)). You could simply use the `_mm256_insertf128_si256` version everywhere.

wrong number of tensors for AdaptLLM/medicine-chat

@cotwitch If you look at that model with `gguf-dump.py`, you will see that it has the tensor `output.weight` duplicated. Not sure how that happened, but that's not a valid model.

wrong number of tensors for AdaptLLM/medicine-chat

Sorry, I do not have any insights about how that may have happened. I guess it is a bug in the conversion script, and the gguf-py library should have prevented...

AVX2 optimization for vec_dot_q4_3_q8_0 and refactoring

``` q4_3 42.94 seconds per pass - ETA 7.81 hours prompt eval time = 54411.09 ms / 631 tokens ( 86.23 ms per token) bs=512 prompt eval time = 59126.51...

First impressions info dump

Adding my first impressions here as well. I had some compile errors in my system: ``` stable-diffusion.cpp/stable-diffusion.cpp: In function ‘void copy_ggml_tensor(ggml_tensor*, const ggml_tensor*)’: stable-diffusion.cpp/stable-diffusion.cpp:171:5: error: ‘memcpy’ was not declared in...

First impressions info dump

I am using `gcc (Ubuntu 12.3.0-1ubuntu1~23.04) 12.3.0`, which should be the current version of GCC in Ubuntu-latest.

Correlation between cpu threads and n-gpu-layers

The behavior is different depending on the GPU backend being used. Since it is a mali GPU, I assume that you are using OpenCL, is that correct?