GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard
running on old gpu with fp32 only
I'm using old p40 , which seems not supporting fp16
I tried to latest triton branch, and compile triton from master.
the inference code shows something like
error: invalid element type in packLLEElements. Expected 'f32' but got 'f16'
error: 'llvm.intr.fmuladd' op requires the same type for all operands and results
Pass execution failedLLVM ERROR: Failed to translate TritonGPU to LLVM IR.
I tried to replace all float16 to float32, and loads the model but
triton.compiler.errors.CompilationError: at 58:33:
zeros = (zeros >> zeros_shifter[None, :]) & maxq
zeros = (zeros + 1)
a = tl.load(a_ptrs, mask=a_mask, other=0.) # (BLOCK_SIZE_M, BLOCK_SIZE_K)
b = tl.load(b_ptrs) # (BLOCK_SIZE_K, BLOCK_SIZE_N), but repeated
# Now we need to unpack b (which is N-bit values) into 32-bit values
b = (b >> shifter[:, None]) & maxq # Extract the N-bit values
b = (b - zeros) * scales # Scale and shift
accumulator += tl.dot(a, b)
^
AssertionError('lhs and rhs must have the same dtype!')
any idea about how to fix it?
found there are several branch here. triton: raising the above problem cuda: works but quite slow old-cuda: works, however still slow and gave weird result
Triton won't support us. They "fixed" it by adding some warnings and asserts. It is not this repo's fault.
The ooba branch/autogptq/my fork work for fast inference. The "faster" kernel that uses FP16 has to be turned off. Pascal performance is FP16 is 1/2 speed and in the 3090 they are equal speed.
Unrelated to the card, however, is old-cuda still seen as fastest? I'm running a 1080ti....but I doubt that matters in this case.
Old cuda with faster kernel disabled is the way to go.