GPTQ-for-LLaMa icon indicating copy to clipboard operation
GPTQ-for-LLaMa copied to clipboard

Lonnnnnnnnng context load time before generation

Open generic-username0718 opened this issue 2 years ago • 5 comments

I'm running llama 65b on dual 3090s and at longer contexts I'm noticing seriously long context load times (the time between sending a prompt and tokens actually being received/streamed). It seems my CPU is only using a single core and maxing it out to 100%... Is there something it's doing that's heavily serialized? ... Any way to parallelize the workflow?

generic-username0718 avatar Mar 13 '23 04:03 generic-username0718

What code did you run?

qwopqwop200 avatar Mar 13 '23 05:03 qwopqwop200

I would like to confirm this issue as well. It really becomes noticeable when they're running chat vs normal/notebook. Chat with nothing set runs really fast but once you start putting context etc... start up speed just takes a nose dive.

4bit 65b on my A6000

USBhost avatar Mar 13 '23 23:03 USBhost

In the case of llama.cpp, when a long prompt is given you can see it output the provided prompt word by word at a slow rate even before it starts generating anything new. It's directly evident that it takes a longer time to to get through larger prompts. I guess a similar thing is happening here.

plhosk avatar Mar 14 '23 05:03 plhosk

So I compared B&B 8bit and GPTQ 8bit and GPTQ was the only one that had a start delay. Something is causing a delay before anything starts generating.

USBhost avatar Mar 20 '23 03:03 USBhost

Runs pretty well once it starts.. not sure if it's loading something, reading layers then inferencing. It's definitely got its quirks of new tech, might just be a case of "well that's how it works"

Digitous avatar Mar 20 '23 21:03 Digitous

Probably fixed now, see #30.

aljungberg avatar Mar 29 '23 09:03 aljungberg

I think this issue has been resolved.

qwopqwop200 avatar Apr 02 '23 02:04 qwopqwop200