Andrei
Andrei
@akarasulu I believe this was fixed by @iamlemec and should be live in the recent 0.3.15 release
Hey @lsorber, thank you for reporting this, a temporary workaround for now is to set n_ubatch as well as n_batch. ie: ```python from llama_cpp import LLAMA_POOLING_TYPE_NONE, Llama embedder = Llama.from_pretrained(...
@lsorber yes it is, I'll try to get that wheel issue resolved as soon as possible!
Should be fixed now
@ayttop maybe someone else knows better but for integrated graphics compiling for the Vulkan bakend may be your only option, though it may not be faster than a CPU installation.
@thewh1teagle can you try re-building with `--verbose` to get an idea of what's being compiled. Additionally, when building `llama.cpp` can you post your full logs and time to build (from...
Linking #1714 as it seems to be the same issue. Seems to be aggressive link-time optimization by msvc
> Explicitly setting the mask through the API would be possible, but I think it would be too difficult to use. I'm partial to this if it's the most straightforward...
@ggerganov I'll see what I can come up with along those lines. We could probably limit complexity by only allowing future attention to work within a batch, otherwise we would...
@iamlemec @ggerganov sounds good, so if I understand correctly the approach would be to update the path for `causal_attn == false` in `decode_internal` to also populate the kv cache and...