Andrei comments

Results 177 comments of


                                            Andrei

Regression in unified KV cache appears after `llama.cpp` release b5912 in b5913

@akarasulu I believe this was fixed by @iamlemec and should be live in the recent 0.3.15 release

`Llama.embed` crashes when `n_batch` > 512

Hey @lsorber, thank you for reporting this, a temporary workaround for now is to set n_ubatch as well as n_batch. ie: ```python from llama_cpp import LLAMA_POOLING_TYPE_NONE, Llama embedder = Llama.from_pretrained(...

`Llama.embed` crashes when `n_batch` > 512

@lsorber yes it is, I'll try to get that wheel issue resolved as soon as possible!

CUDA 12.1 Llama-cpp-python version 0.2.84 pre-built request.

Should be fixed now

igpu

@ayttop maybe someone else knows better but for integrated graphics compiling for the Vulkan bakend may be your only option, though it may not be faster than a CPU installation.

Installation stuck on windows

@thewh1teagle can you try re-building with `--verbose` to get an idea of what's being compiled. Additionally, when building `llama.cpp` can you post your full logs and time to build (from...

Installation stuck on windows

Linking #1714 as it seems to be the same issue. Seems to be aggressive link-time optimization by msvc

Add PaliGemma Support

> Explicitly setting the mask through the API would be possible, but I think it would be too difficult to use. I'm partial to this if it's the most straightforward...

Add PaliGemma Support

@ggerganov I'll see what I can come up with along those lines. We could probably limit complexity by only allowing future attention to work within a batch, otherwise we would...

Add PaliGemma Support

@iamlemec @ggerganov sounds good, so if I understand correctly the approach would be to update the path for `causal_attn == false` in `decode_internal` to also populate the kv cache and...