Forkoz

Results 474 comments of Forkoz

Hard to say. When using multiple cores for the stuff that did, it was often slower. But people report CPU bottlenecks all over the place.

It has been sucking for autosplit in my experience lately too. A lot of models have to be done manually. I think there is a parameter to set reserved memory...

This is me too. And I have 24gb of ram and 96 of system ram.. I am not out of ram.

I converted a model to 8-bit using AutoGPTQ and now it works both here and in what converted it.. so the quantization code is broken and not inference.

Yea.. download autogptq and install then use this script: https://github.com/PanQiWei/AutoGPTQ/blob/main/examples/quantization/quant_with_alpaca.py

Yes.. 6b would work great for 13b and below to make the model smarter.

Write a python script to convert to FP16 from FP32.. don't use GPTQ.

Does this work through HIP on ROCM?

Triton won't support us. They "fixed" it by adding some warnings and asserts. It is not this repo's fault. The ooba branch/autogptq/my fork work for fast inference. The "faster" kernel...

Old cuda with faster kernel disabled is the way to go.