John comments

Results 101 comments of


                                            John

cuBLAS memory management update

I think a prefetch / cache is certainly the way to go, there is a ton of improvements to the current implementation. Regarding my variant: It was not tailored to...

llama : add Falcon LLM support

> First we need to implement ggml Mind elaborating on that, it does not seem to make sense in context. From what I read, I've not tested it, the model...

llama : add Falcon LLM support

I've just invested almost an hour of prompting into Instruct Falcon 40B and it's significantly smarter than OpenAssisst 30B, despite being less well tuned. It is smarter than Turbo when...

llama : add Falcon LLM support

> there's a guy who provided a q4b version of Falcon7B, would it be of some use for llama.cpp ? > > https://github.com/Birch-san/falcon-play Falcon has the full precision binaries available...

llama : add Falcon LLM support

I took a look and Falcon is Bloom based, uses GPT-NeoX rot embeddings, gelu activation https://huggingface.co/tiiuae/falcon-40b/commit/e7950c40d6bc9caca678af160de9c79f33f93699 It looks like most of it is covered in https://github.com/NouamaneTazi/bloomz.cpp already. Though looks like...

llama : add Falcon LLM support

Not without adaption, I've not looked into the differences (aside of the parameter and layer counts) but there certainly are some. Also bloomz is barebones, no GPU support, etc. It...

llama : add Falcon LLM support

They updated the main page, not the model pages yet. They are just a bit slow to follow up but it looks like we get a full open source model....

llama : add Falcon LLM support

I also struggled, didn't get it to run yet. There are significant differences in the attention/kqv handling between 7B and 40B: Without multi_query (40B): ``` self.query_key_value = Linear( self.hidden_size, (config.n_head_kv...

Cuda refactor, multi GPU support

@JohannesGaessler Did you see that project ? https://github.com/turboderp/exllama/tree/master/exllama_ext/cuda_func Looks like a ton of kernels on MIT license including full matmul for half and quantized variants, rope, norm etc. Not sure...

Output of quantized Vicuna is so inappropriate that I can't use it

I lack experience with that particular model but I do notice that you attempt a complex instruction solving translation using a 7B model. So even if it is very well...