Is that the case that larger prompt takes longer time to just get started for the first token?
Overview
I am using the llama-7b-4_KM model for my simple Chat over docs application which involves fetching relevant chunks and feeding (stuffing) them in the prompt.
Problem
Each chunk has size of around 700 tokens and I am fetching 3 of them. Now, the prompt becomes pretty large and when I do the inference (stream) it takes like 2-3 minutes (with GPU!!) just to get started.
And once it begins the response, it completes within 4-5 seconds, so generation speed is acceptable.
I am only looking for the solution why it takes more time to just get started.
On the other hand, if the prompt is small like just AI is going to then within the second it starts generation, which means larger the prompt the more time it takes.
But the same model (llama) but loaded in 8 bit and used with Transformers don't take much time for the same RAG application. It happens with the GGUF format.
Please help 🙏🏻