llm Good ideas from llama.cpp

I've been tracking the llama.cpp repo. I'll use this issue to list any good ideas / things we should be aware of to keep up with in Rust land:

[ ] GPTQ quantization :eyes: https://github.com/ggerganov/llama.cpp/issues/9
[ ] Not sure how that is even possible (isn't the task I/O bound?), but people are claiming great speedups when loading the modelling in parallel. This should be pretty easy to implement using rayon. https://github.com/ggerganov/llama.cpp/issues/85#issuecomment-1470814328
[ ] Seems there's an issue with the normalization function used. It should be RMSNorm. Would be good to keep an eye on this, and simply swap the the ggml function once it's implemented on the C++ side :eyes: https://github.com/ggerganov/llama.cpp/issues/173#issuecomment-1470801468
[ ] It looks like dropping to F16 for the memory_k and memory_v reduces memory usage. It is not known whether this hurts quality, but we should follow the C++ side and add a flag to drop to F16 for the memory. This would also make the cached prompts added as part of #14 take half the size on disk, which is a nice bonus: https://github.com/ggerganov/llama.cpp/pull/154#pullrequestreview-1342214004
[x] Looks like the fix from #1 just landed upstream. We should make sure to fix it here too https://github.com/ggerganov/llama.cpp/pull/161
[ ] The tokenizer used in llama.cpp has some issues. It would be better to use sentencepiece, which is the one that was used during the original LLaMA training. There seems to be a rust crate for sentencepiece. We should check if a drop-in replacement is possible https://github.com/ggerganov/llama.cpp/issues/167

Mar 15 '23 21:03 setzer22

Suggest pinning this issue :>

Mar 16 '23 01:03 philpax

For the tokenizer item I suggest using https://github.com/huggingface/tokenizers/

Should work out of the box once converted (when this PR lands: https://github.com/huggingface/transformers/pull/21955 it should become a simple let tokenizer = Tokenizer::from_file("filename") ) Cheers!

Mar 16 '23 07:03 Narsil

RMS norm landed, but they've reported regressions. Need to keep an eye on that.

Mar 16 '23 11:03 philpax

@Narsil Llamatokenizer need to byte fallback option.🥹

For the tokenizer item I suggest using https://github.com/huggingface/tokenizers/

Should work out of the box once converted (when this PR lands: https://github.com/huggingface/transformers/pull/21955 it should become a simple let tokenizer = Tokenizer::from_file("filename") ) Cheers!

Mar 16 '23 12:03 dongs0104

Good news everyone !

https://github.com/huggingface/tokenizers/pull/1183

(If this goes, I'll try to make a release soon after)

Mar 17 '23 14:03 Narsil

Awesome! Looking forward to it :D

Mar 17 '23 15:03 philpax

A small comment on the parallel loading: It is definitely possible to improve IO reads by parallelizing. This is much more effective on SSDs but still works on HDDs due to caching at different layers. However this should be configurable since the performance can start to degrade at certain points of parallelism, depending on the storage medium and also stuff like the kernel and buffer sizes

Mar 18 '23 15:03 dnlmlr

@dnlmlr Do you have bench to back that up ? I didn't found that to be the case whenever I tried.

Memory-mapping was always consistently better than reading a file (Provided you need the whole file) and it doesn't require parallism (at user-level that is, no idea how the kernel is handling it)

Mar 20 '23 16:03 Narsil

@setzer22 Are you okay with me closing this issue and splitting it into individual issues?

Mar 26 '23 00:03 philpax

Yup, sounds good :+1:

Mar 26 '23 13:03 setzer22

This issue has been superseded by #35, #62, #78, #79 and #80.

Mar 26 '23 14:03 philpax

llm llm copied to clipboard

Good ideas from llama.cpp

llm
llm copied to clipboard