John

Results 101 comments of John

I think a prefetch / cache is certainly the way to go, there is a ton of improvements to the current implementation. Regarding my variant: It was not tailored to...

> First we need to implement ggml Mind elaborating on that, it does not seem to make sense in context. From what I read, I've not tested it, the model...

I've just invested almost an hour of prompting into Instruct Falcon 40B and it's significantly smarter than OpenAssisst 30B, despite being less well tuned. It is smarter than Turbo when...

> there's a guy who provided a q4b version of Falcon7B, would it be of some use for llama.cpp ? > > https://github.com/Birch-san/falcon-play Falcon has the full precision binaries available...

I took a look and Falcon is Bloom based, uses GPT-NeoX rot embeddings, gelu activation https://huggingface.co/tiiuae/falcon-40b/commit/e7950c40d6bc9caca678af160de9c79f33f93699 It looks like most of it is covered in https://github.com/NouamaneTazi/bloomz.cpp already. Though looks like...

Not without adaption, I've not looked into the differences (aside of the parameter and layer counts) but there certainly are some. Also bloomz is barebones, no GPU support, etc. It...

They updated the main page, not the model pages yet. They are just a bit slow to follow up but it looks like we get a full open source model....

I also struggled, didn't get it to run yet. There are significant differences in the attention/kqv handling between 7B and 40B: Without multi_query (40B): ``` self.query_key_value = Linear( self.hidden_size, (config.n_head_kv...

@JohannesGaessler Did you see that project ? https://github.com/turboderp/exllama/tree/master/exllama_ext/cuda_func Looks like a ton of kernels on MIT license including full matmul for half and quantized variants, rope, norm etc. Not sure...

I lack experience with that particular model but I do notice that you attempt a complex instruction solving translation using a 7B model. So even if it is very well...