GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard
Lonnnnnnnnng context load time before generation
I'm running llama 65b on dual 3090s and at longer contexts I'm noticing seriously long context load times (the time between sending a prompt and tokens actually being received/streamed). It seems my CPU is only using a single core and maxing it out to 100%... Is there something it's doing that's heavily serialized? ... Any way to parallelize the workflow?
What code did you run?
I would like to confirm this issue as well. It really becomes noticeable when they're running chat vs normal/notebook. Chat with nothing set runs really fast but once you start putting context etc... start up speed just takes a nose dive.
4bit 65b on my A6000
In the case of llama.cpp, when a long prompt is given you can see it output the provided prompt word by word at a slow rate even before it starts generating anything new. It's directly evident that it takes a longer time to to get through larger prompts. I guess a similar thing is happening here.
So I compared B&B 8bit and GPTQ 8bit and GPTQ was the only one that had a start delay. Something is causing a delay before anything starts generating.
Runs pretty well once it starts.. not sure if it's loading something, reading layers then inferencing. It's definitely got its quirks of new tech, might just be a case of "well that's how it works"
Probably fixed now, see #30.
I think this issue has been resolved.