ggml icon indicating copy to clipboard operation
ggml copied to clipboard

GPT Benchmarks

Open mallorbc opened this issue 3 years ago • 2 comments

GPT models without KV cache have to recalculate values and thus time to compute grows exponentially given a longer input.

Thus, for your benchmarks, how many tokens were generated, and with how many total? Does this support a caching system?

mallorbc avatar Oct 13 '22 04:10 mallorbc

In all benchmarks I generated 200 tokens, starting with a prompt consisting of a single token.

My implementation does support KV caching - I used the term "memory":

https://github.com/ggerganov/ggml/blob/e2f39f4b5295de0661d3c0ac4dfb89d4357c86f0/examples/gpt-2/main.cpp#L259-L276

Here we store new values into the memory:

https://github.com/ggerganov/ggml/blob/e2f39f4b5295de0661d3c0ac4dfb89d4357c86f0/examples/gpt-2/main.cpp#L448-L455

And here we use the cached data:

https://github.com/ggerganov/ggml/blob/e2f39f4b5295de0661d3c0ac4dfb89d4357c86f0/examples/gpt-2/main.cpp#L466-L473

https://github.com/ggerganov/ggml/blob/e2f39f4b5295de0661d3c0ac4dfb89d4357c86f0/examples/gpt-2/main.cpp#L507-L514

Even with caching, the processing time increases with more and more tokens. The benchmarks are the average time across generating the 200 tokens.

ggerganov avatar Oct 13 '22 06:10 ggerganov

Thanks for the insight. Also, very impressive work.

mallorbc avatar Oct 18 '22 16:10 mallorbc