compilade
compilade
I've fixed the pooled embeddings problem with Mamba in by making it only process a single sequence per `ubatch`. When the sequences are short, this is slightly slower than processing...
> Not sure if I'm understanding the comment correctly @jukofyork, but the logic I'm using to identify the most influential tensors/layers is to simply average the importance scores (IS) for...
I'd like it very much if they released a smaller version of their model. I don't have enough RAM to run Mixtral (only have 8GB), and Jamba seems to be...
> Any update on Jamba support? I've worked on refactoring the KV cache in the past weeks to allow managing both recurrent states and Attention's KV cache at once. (See...
> For your endeavors, could I 'Buy You a Coffee' to help support? @severian42 I appreciate the offer (it means a lot!), but I can't accept for now. Receiving international...
Okay, turns out I only had to put like, 2 to 3 more days of work on this and BAM **it works**. As of today, in [branch `refactor-kv-cache`](), using the...
There is still more work I need to put into this. I've got inference working, but things that are not yet done are: - state saving and reloading to and...
> how can they work if the issue is not complete? @ELigoP Well, technically the layout of the GGUF files doesn't really need to be changed further for Jamba support,...
> They adopt a channel wise scaling factor compared to the tensor level ones. Maybe a separate kennel can be built to apply scales outside of the matmul kernels? Hmm,...