Nikhil Gupta
Nikhil Gupta
> Hi @NicoJuicy. We are not currently supporting TF Lite, but this can definitely be an interesting feature to include in the future! In the coming days we will draw...
I am facing the same issue with one of the block of llm model that I am trying to convert
Hello, were you able to fix this issue?
> More on this: recent koboldcpp build, snapdragon 8 Gen 1, termux. > > Any quant is garbled at GGUF model. k quant or not. Offloaded layers or not. GGML...
Hello @alankelly @wei-v-wang , How can we fix this issue if we are sticking to ubuntu 16 & gcc 5.4.0 ? I have tried #define _POSIX_C_SOURCE 199309L as suggested by...
Hey @naveengovind did you manage to get a fix yet? I am facing the exact same issue right now for my usecase. @xenova Do you have any further inputs on...
the X activation is fed back for multiple layers after being changed with help of attention output. and that's why attention is needed I guess for input tokens as well.
We can definitely avoid ` rmsnorm(x, x, w->rms_final_weight, dim); // classifier into logits matmul(s->logits, x, w->wcls, p->dim, p->vocab_size);` for input tokens . It will give some perf bump
What is the perf that your are getting on TVM CPU & TVM GPU backend. If you Arm Compute Library implementaion is ready, can you please share its perf as...
Hello Does the matmul implementation support all the quantizations ( Q8_0 , Q4_0) on QNN ? Did we check the accuracy of the matmul ?