GPTQ models load much slower than 0cc4m's fork used to

Open fkiifdjo opened this issue 2 years ago • 1 comments

The slow loading happens every time a new chat message is being generated, it's particularly noticeable with 30b models but also noticeable with 13b.

Aug 24 '23 12:08 fkiifdjo

There are currently two issues going on:

Occam's GPTQ version is not properly compatible with newer huggingface builds so you are falling back to AutoGPTQ in more cases.
AutoGPTQ itself has an outstanding issue where this happens on some systems which they still need to address.

So on the Kobold side we are waiting for either GPTQ package to get the needed updates before the full speed is restored.

Aug 24 '23 12:08 henk717