compilade

Results 108 comments of compilade

> I don't have the system resources required to run Qwen 72B myself, but if anyone knows of another model that uses quantized fallbacks i'll be happy to test that!...

> for the same reason we avoid taking K-quanted tensors and mixing them into a q5_0 or q4_1 model. Note that `q6_K` is still used for `output.weight` when making a...

> Would something like this work? Yes, using `fmaf` should work. But I would prefer to avoid using FMA in the reference implementation, because in Numpy this has to be...

Related to the FMA rounding of `Q4_0` and `Q5_0`, it seems that ***all k-quants*** (and I guess also i-quants) have platform-dependent rounding because `nearest_int` is marked `inline`, and so it...

@CISC I went ahead and changed `gguf-py/pyproject.toml` as you suggest and removed `gguf-py/gguf/scripts/__init__.py` because it's not really needed since implicit namespaces added in (in Python 3.3). I've tested the scripts...

> Any reason for not merging yet? Not really, sorry (I got distracted). > If nix/flake changes are a concern they can be left out for now, ref I've been...

> Is this enough to confirm the SWA functionality? I think so. Might also be relevant to test SWA with parallel sequences too (I *think* this is what using a...

> Guys, is there any progress in supporting Mamba2 (I'm interested in the new mamba-codestral)? Still waiting on some upstream changes (see ), but otherwise I'm beginning to investigate the...

I'll be re-running a few tests before merging this in hopefully less than 2 days. There is now both [Mamba-2](https://github.com/ggerganov/llama.cpp/issues/8519#issuecomment-2295593409) and [RWKV v6](https://github.com/ggerganov/llama.cpp/pull/8980) which kind of need this to simplify...

I've ran some tests, and there's a problem: pooled embeddings with Mamba can't work with multiple sequences anymore. This is because `lctx.embd_seq` is overwritten at each `ubatch` which makes it...