exllama Working with TheBloke/WizardLM-30B-Uncensored-GPTQ

Hi! I got this to work with TheBloke/WizardLM-30B-Uncensored-GPTQ.

Here's what worked:

This doesn't work on windows, but it does work on WSL
Download the model (and all files) from HF and place it somewhere. Put this somewhere inside the wsl linux filesystem, not under /mnt/c/somewhere otherwise the model loading will be mega slow regardless of your disk speed
on model.py I added the following:

        # self.groupsize = (self.qweight.shape[0] * 8) // self.qzeros.shape[0]
        self.groupsize = None
        self.config.groupsize = None
        self.config.act_order = True

        # self.config.groupsize = self.groupsize
        # if self.config.groupsize is None:
        #     self.config.groupsize = self.groupsize
        # else:
        #     if self.config.groupsize != self.groupsize:
        #         raise ValueError("Irregular groupsize for matrix: " + key + ", " + str(self.config.groupsize) + ", "+ str(self.groupsize))

Note the commented out code and the additions 4. I had to use -mm pytorch_only and -a pytorch_matmul

May 23 '23 17:05 gabriel-peracio

That's helpful. I'll look into it. Probably just yet another variant of GPTQ to consider.

May 23 '23 18:05 turboderp

I pushed an update now to deal with weights without groupsize. Seems to work here at least, also with the quantized matmul to give 33 tokens/second on my setup. So you wouldn't need -mm pytorch_only or to disable groupsize detection. Curious to hear if it did the trick on Windows as well.

Also, are you sure you needed the -a pytorch_matmul argument? Cause that would be a completely separate issue if scaled_dot_product_attention() doesn't work, since attention doesn't touch the quantized matrices at all. Maybe you're on PyTorch 1.x?

May 23 '23 20:05 turboderp

Curious to hear if it did the trick on Windows as well

Windows couldn't get past the CUDA extension compilation. It's complaining that it can't find python310.lib (I'm using a venv). If I copy+paste python310.lib, then it fails to link against cublas. I don't particularly care, since my WSL install is working fine

Partial error output

Creating library exllama_ext.lib and object exllama_ext.exp
half_matmul.cuda.o : error LNK2019: unresolved external symbol cublasHgemm referenced in function "enum cudaError __cdecl half_matmul_cublas_cuda(struct __half const *,struct __half const *,struct __half *,int,int,int,struct cublasContext *)" (?half_matmul_cublas_cuda@@YA?AW4cudaError@@PEBU__half@@0PEAU2@HHHPEAUcublasContext@@@Z)
exllama_ext.pyd : fatal error LNK1120: 1 unresolved externals
ninja: build stopped: subcommand failed.

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:36:15_Pacific_Daylight_Time_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Also, are you sure you needed the -a pytorch_matmul argument?

Nope, I just monkeyed around with flags I didn't understand until it started working, then I stopped - I bet the force-overrides I did in the code (e.g. act_order) were not needed either.

I pushed an update now to deal with weights without groupsize

With the new update, I don't need any flags, and everything works and is much faster. Here are my results:

System info:

Ryzen 9 7950X3D
64GB DDR5 6000
RTX 4090

Benchmark results:

 -- Sequence length: 2048
 -- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'perf', 'perplexity']
 ** Time, Load model: 4.36 seconds
 -- Groupsize (inferred): None
 -- Act-order (inferred): no
 ** VRAM, Model: [cuda:0] 15,936.29 MB
 -- Inference, first pass.
 ** Time, Inference: 1.97 seconds
 ** Speed: 976.28 tokens/second
 -- Generating 128 tokens...
 ** Speed: 26.38 tokens/second
 ** VRAM, Inference: [cuda:0] 4,014.11 MB
 ** VRAM, Total: [cuda:0] 19,950.40 MB
 -- Loading dataset...
 -- Testing..........
 ** Perplexity: 5.7553

I'm not sure how to get the versions of dependencies it is using, I'm not a Python guy, but looking at my venv folder I have Torch 2.1.0.dev20230523+cu118 and cuda: 11.8

I should be getting more than that, right?

May 24 '23 12:05 gabriel-peracio

This issue seems to have gotten forgotten, but yeah, you should be getting better speeds than that. A lot has changed in the last three weeks, so you could try with the latest version. Also the Windows native version seems to be faster, at least for some people, using Hardware Accelerated GPU Scheduling. So that could be a thing to try.

Jun 17 '23 13:06 turboderp

exllama exllama copied to clipboard

Working with TheBloke/WizardLM-30B-Uncensored-GPTQ

exllama
exllama copied to clipboard