llm
llm copied to clipboard
Making results independent from threadcount/batch size (from llama.cpp)
This may be something to keep an eye on: https://github.com/ggerganov/llama.cpp/pull/439
Looks like the corresponding code is here: https://github.com/rustformers/llama-rs/blob/bf7bdbcfff3114dcbdafb6eb7eed58f04f19b1c3/llama-rs/src/lib.rs#L1203
According to the comments in the pull, it should trade a small amount of performance for less memory usage. However, at least one user commented they saw more memory use (not sure what size model).