llm
llm copied to clipboard
Copy v_transposed like llama.cpp
See https://github.com/ggerganov/llama.cpp/pull/439
Closes #67
I'm not necessarily proposing to merge this, just putting it here in case it's useful.
From my very, very, very unscientific testing, it seems like this does very slightly increase memory usage and also increases token generation time a little bit.
These notes are super unscientific, but included just in case it's useful. Also note these tests on were on a machine running other applications like VS Code, web browsers, etc. The tests with the 30B model are close to my machine's memory limit (32GB) so may have caused some swapping as well.
The differences are definitely in the margin of error just because the tests weren't very controlled. (It also did seem like it made more of a difference with 12 threads vs 6. My CPU only has 6 physical cores.)
==== 7B 12t
new
Maximum resident set size (kbytes): 4261744
feed_prompt_duration: 5502ms
prompt_tokens: 18
predict_duration: 14414ms
predict_tokens: 50
per_token_duration: 288.280ms
6t
Maximum resident set size (kbytes): 4361328
feed_prompt_duration: 3942ms
prompt_tokens: 18
predict_duration: 47036ms
predict_tokens: 146
per_token_duration: 322.164ms
old 12t
Maximum resident set size (kbytes): 4253076
feed_prompt_duration: 4119ms
prompt_tokens: 18
predict_duration: 12705ms
predict_tokens: 50
per_token_duration: 254.100ms
t6
Maximum resident set size (kbytes): 4290144
feed_prompt_duration: 4001ms
prompt_tokens: 18
predict_duration: 39464ms
predict_tokens: 146
per_token_duration: 270.301ms
-------------
new 13B 12t
Maximum resident set size (kbytes): 8326708
feed_prompt_duration: 8033ms
prompt_tokens: 18
predict_duration: 83420ms
predict_tokens: 146
per_token_duration: 571.370ms
new 13B 6t
Maximum resident set size (kbytes): 8173012
feed_prompt_duration: 7985ms
prompt_tokens: 18
predict_duration: 42496ms
predict_tokens: 82
per_token_duration: 518.244ms
feed_prompt_duration: 8160ms
prompt_tokens: 18
predict_duration: 41615ms
predict_tokens: 82
per_token_duration: 507.500ms
old 13B 12t
Maximum resident set size (kbytes): 8210536
feed_prompt_duration: 7813ms
prompt_tokens: 18
predict_duration: 71144ms
predict_tokens: 146
per_token_duration: 487.288ms
6t
feed_prompt_duration: 9226ms
prompt_tokens: 18
predict_duration: 39793ms
predict_tokens: 82
per_token_duration: 485.280ms
----
new 30B 6t
Maximum resident set size (kbytes): 20291036
feed_prompt_duration: 18688ms
prompt_tokens: 18
predict_duration: 97053ms
predict_tokens: 82
per_token_duration: 1183.573ms
old
Maximum resident set size (kbytes): 20257344
feed_prompt_duration: 19693ms
prompt_tokens: 18
predict_duration: 93953ms
predict_tokens: 82
per_token_duration: 1145.768ms
I'm a bit confused about this change. Does it increase quality? Because from what you're reporting, it seems to increase memory use and increase inference time.
Probably needs some more scientific testing :smile: At least comparing mean and stdev of 5 executions on main
and this branch for inference time and memory usage to see if the differences are statistically significant. Inference time should be easy to record now that we print statistics at the end of generation. But we have nothing to accurately measure memory usage :thinking:
@setzer22 See the comments and referenced issues in the llama.cpp pull: https://github.com/ggerganov/llama.cpp/pull/439
I basically just made the same change, I can't say I understand it or anything.
According to the pull and issue, previously llama.cpp had undeterministic results based on the thread count and batch size. That pull was intended to fix that issue, so that llama.cpp could produce deterministic results. I assume it would be the same for llama-rs.
Right now llama-rs doesn't have a way to specify the RNG seed so it's a less visible problem, but that's probably something that would be good to add for reproducible testing.
Actually, I'm blind, llama-rs does have --seed
. You can test it out yourself: Set the seed to something, generate some tokens, restart with a different thread or batch size and you will get completely different results on the main branch.
With this change, the results are exactly identical with the same seed regardless of batch size or thread count. I tested 4-5 times with each and various batch and thread counts.
One thing we could potentially do is make this a CLI option, it would mean duplicating that one line of code which probably isn't a big deal. Maybe something like --reproducible
? We could also (or alternatively) turn on this behavior if --seed
is specified.
Hm, something like improve_determinism: bool
in InferenceParameters
? deterministic
might promise more than we can actually do, but I'm fine with the change in principle (as long as we document what it does well)
Sure that's certainly no problem. Or increase_determinism
. (As far as I could tell, it was deterministic with a seed specified from the testing I did. There might be edge cases or specific hardware where it works differently.)
Would you want it as a commandline flag also? Also, should it be automatically enabled if --seed
is specified (right now if you change the threads or batch size even with --seed
you'll probably get different results).
Yup, I'll leave the exact name to you. I think it's reasonable to turn it on with the seed setting; I can't think of many circumstances in which you'd want to set a seed without the additional determinism, and I don't think determinism matters if you don't have a seed set.
As far as I could tell, it was deterministic with a seed specified from the testing I did
Yup, we're just being a bit careful with promising determinism overall, since we haven't done any proper testing. When fixing a seed, things should be reproducible on the same hardware, but we haven't done any testing to ensure this is reproducible across machines (other than some very anecdotal testing on my desktop + laptop, for which I can report results are indeed identical).
There were some caveats reported in #27 about floating point math determinism in general which may or may not affect us (since we're on CPU, and CPU manufacturers take floating point determinism more seriously).
Would you want it as a commandline flag also? Also, should it be automatically enabled if --seed is specified
I would say, yes. It's best to have this under a command-line flag (disabled by default) if it makes inference speed measurably lower. And I think automatically enabling when --seed is specified is also OK, sounds like good ergonomics :+1:
I've been doing some tests, but it's hard to measure if inference speed has gotten slower with the code change because different prompts can make inference speed vary by up to 30ms / token on my machine. Both on main
and on this branch.
It's hard to draw conclusions about the timings here because I'm getting different completions for the same seed, and timings are affected by differences in the generated prompt.
What I can report positively on is that I'm getting different results with the same seed and different number of threads on main
, and this PR fixes it :)
Would it be too confusing to call it --toggle-increased-determinism
or something like that? This way, it could get automatically enabled with --seed
but disabled if necessary or enabled otherwise as well. Or maybe make it an argument that takes a value like --increased-determinism
where it could be auto
, on
, off
.
It's probably silly to worry about it, I just hate making it so there's just no way to do something that someone might want to do. (Probably my biggest pet peeve with software is where it just says "No, I can't let you do that, you might hurt yourself!")
edit: And yeah, I honestly couldn't tell if it made a noticeable difference. It seemed like it was pretty small if it did exist.
How about this approach? It adds a --increased-determinism
option that can be set to auto
(default), on
, off
. The option includes an explanation of why it's necessary.
I made Model::evaluate
just take &InferenceParameters
rather than just adding more arguments to it. I also put the actual v_transposed
calculation in a closure outside of the loops to avoid code duplication in the conditional.
I did some profiling and the difference between the charts seems very small, so the observed differences probably aren't anything more than noise. (It looks like copying the data may account for 1-2% of the total time.)
I guess it's not really surprising, but this also shows that the code that executes in llama-rs itself is basically insignificant in performance terms. (Obviously the way it sets up the GGML stuff is very important though.)
Old FC
New FC
Old FG
New FG
Interesting. If the cost is that unnoticeable, I'm inclined to merge it as-is. Thoughts, @setzer22 ?
I should add that this is the first time I've tried profiling Rust code or reading flamegraphs so my opinion about it isn't too authoritative. It doesn't seem like there's much any significant visual difference though.
By the way, you might have already noticed but if you click/open those SVGs, they're interactive and you can focus elements or search based on name within it.
I couldn't notice any performance differences in my tests either, so I'd say we can merge as is. No need to put it behind a flag.
I guess it's not really surprising, but this also shows that the code that executes in llama-rs itself is basically insignificant in performance terms. (Obviously the way it sets up the GGML stuff is very important though.)
Yup. I don't think there's anything we can do on the rust side (other than, as you say, change the tensor operations) that would have a significant impact on performance. All the heavy lifting is done by ggml.
Thanks for updating the branch! Looks like setzer and I agree that the flag's not necessary given the unnoticeable performance hit - can you remove it and I'll merge the PR?
@setzer22 How about this? I left it as a parameter in the library (defaulting to enabled) but removed the CLI stuff.
The rationale is it doesn't really add much complexity and there's at least a chance we'll want to revert it or maybe someone will have a use case where they care.
@philpax Were you talking about the CLI option, or just making it 100% not configurable at all? I can do that if you really want, but personally I'd give it a while being enabled by default and see if it turns out there's some reason for people to want to disable it.
Right now it shouldn't really even break library stuff since the struct has a Default
instance, so someone would have to explicitly be setting that field to something for it to matter if we removed it in the future. (And I changed the main CLI to use that rather than setting the parameter explicitly, so people who used that code as a reference won't run into issues either.)
Yeah, I'm fine with your rationale. Seems reasonable enough.
Same here :+1: What I would do is always enable this by default, and just have a flag to disable it.