mistral.rs
mistral.rs copied to clipboard
Fix timings of completion, add timing of sampling back
Now that we're sampling fully in CPU, we should not merge the sampling timings into completion timings.
This will likely show an improvement on mistralrs-bench
's tg test.
Notice llama-bench
selects a random token https://github.com/ggerganov/llama.cpp/blob/master/examples/llama-bench/llama-bench.cpp#L1161 so currently tg
test is not a truly fair comparison.
Basically, we need to revert https://github.com/EricLBuehler/mistral.rs/pull/151
I'm not sure if it's better to re-design the code or mutate some variable here, which is the point where we know the correct timings
I think that we should start sampling timing there as it excludes the sync point.
Done in a previous PR which didn't close this.