Forkoz comments

Results 474 comments of


                                            Forkoz

Multiprocess, Multithreading and Paralelism

Hard to say. When using multiple cores for the stuff that did, it was often slower. But people report CPU bottlenecks all over the place.

Auto-Split Error for GPTQ on Nvidia

It has been sucking for autosplit in my experience lately too. A lot of models have to be done manually. I think there is a parameter to set reserved memory...

Fixing Triton -"Unexpected MMA layout version found" for prevolta GPUs raises new problems

This is me too. And I have 24gb of ram and 96 of system ram.. I am not out of ram.

CUDA: 8bit quantized models are stupid.

I converted a model to 8-bit using AutoGPTQ and now it works both here and in what converted it.. so the quantization code is broken and not inference.

CUDA: 8bit quantized models are stupid.

Yea.. download autogptq and install then use this script: https://github.com/PanQiWei/AutoGPTQ/blob/main/examples/quantization/quant_with_alpaca.py

6-bit quantization

Yes.. 6b would work great for 13b and below to make the model smarter.

wbit=16 Conversion Gives Error

Write a python script to convert to FP16 from FP32.. don't use GPTQ.

OpenCL support

Does this work through HIP on ROCM?

running on old gpu with fp32 only

Triton won't support us. They "fixed" it by adding some warnings and asserts. It is not this repo's fault. The ooba branch/autogptq/my fork work for fast inference. The "faster" kernel...

running on old gpu with fp32 only

Old cuda with faster kernel disabled is the way to go.