Sasha Rush
Sasha Rush
yeah, I agree with all this. The names and coding conventions came directly from https://github.com/karpathy/llama2.c I just copied them over. I'll do another pass to make the names better and...
Oh weird, for some reason they added 2 additional word tokens. 2 * 5120 * 2 * 4 bytes I'll take them out for now, and think about a way...
Thanks! This is all really helpful. String stuff tripped me up a bit. If you have a minute can you explain 6 to me. How do I check that it...
Amazing, that's really helpful to know. Thanks for pointing it out. Do you plan on continuing to work on this? Was planning on moving on, but now I'm kind of...
Nice. This bumped me up from 0.92 t/s to 1.02 t/2 on llama2 7B.
Nice, I will try to catch up on your code. Some of the HF people recommended trying to do GPTQ inference (quant-full mat-vec). Which version are you doing?
hi! I saw that you are also a maintainer of Triton and worked on the AoT compiler. I'm playing around with trying to set this project up to use Triton...
Thanks, once I got it running it was fast, but then when I tried to further optimize the Triton code, the rust version went out of sync with the python...
It's using Rayon for data parallel matrix vector mult, but no other libraries. See the rust library `Candle` which has a full implementation with matrix mults. Was thinking I would...
1. Should work with fine with arm. But currently it is f32 only. (Note though this is CPU no gpu support) Have to think about how to add f16. 2....