GPT Benchmarks
GPT models without KV cache have to recalculate values and thus time to compute grows exponentially given a longer input.
Thus, for your benchmarks, how many tokens were generated, and with how many total? Does this support a caching system?
In all benchmarks I generated 200 tokens, starting with a prompt consisting of a single token.
My implementation does support KV caching - I used the term "memory":
https://github.com/ggerganov/ggml/blob/e2f39f4b5295de0661d3c0ac4dfb89d4357c86f0/examples/gpt-2/main.cpp#L259-L276
Here we store new values into the memory:
https://github.com/ggerganov/ggml/blob/e2f39f4b5295de0661d3c0ac4dfb89d4357c86f0/examples/gpt-2/main.cpp#L448-L455
And here we use the cached data:
https://github.com/ggerganov/ggml/blob/e2f39f4b5295de0661d3c0ac4dfb89d4357c86f0/examples/gpt-2/main.cpp#L466-L473
https://github.com/ggerganov/ggml/blob/e2f39f4b5295de0661d3c0ac4dfb89d4357c86f0/examples/gpt-2/main.cpp#L507-L514
Even with caching, the processing time increases with more and more tokens. The benchmarks are the average time across generating the 200 tokens.
Thanks for the insight. Also, very impressive work.