exllama
exllama copied to clipboard
Working with TheBloke/WizardLM-30B-Uncensored-GPTQ
Hi! I got this to work with TheBloke/WizardLM-30B-Uncensored-GPTQ.
Here's what worked:
- This doesn't work on windows, but it does work on WSL
- Download the model (and all files) from HF and place it somewhere. Put this somewhere inside the wsl linux filesystem, not under
/mnt/c/somewhereotherwise the model loading will be mega slow regardless of your disk speed - on
model.pyI added the following:
# self.groupsize = (self.qweight.shape[0] * 8) // self.qzeros.shape[0]
self.groupsize = None
self.config.groupsize = None
self.config.act_order = True
# self.config.groupsize = self.groupsize
# if self.config.groupsize is None:
# self.config.groupsize = self.groupsize
# else:
# if self.config.groupsize != self.groupsize:
# raise ValueError("Irregular groupsize for matrix: " + key + ", " + str(self.config.groupsize) + ", "+ str(self.groupsize))
Note the commented out code and the additions
4. I had to use -mm pytorch_only and -a pytorch_matmul
That's helpful. I'll look into it. Probably just yet another variant of GPTQ to consider.
I pushed an update now to deal with weights without groupsize. Seems to work here at least, also with the quantized matmul to give 33 tokens/second on my setup. So you wouldn't need -mm pytorch_only or to disable groupsize detection. Curious to hear if it did the trick on Windows as well.
Also, are you sure you needed the -a pytorch_matmul argument? Cause that would be a completely separate issue if scaled_dot_product_attention() doesn't work, since attention doesn't touch the quantized matrices at all. Maybe you're on PyTorch 1.x?
Curious to hear if it did the trick on Windows as well
Windows couldn't get past the CUDA extension compilation. It's complaining that it can't find python310.lib (I'm using a venv). If I copy+paste python310.lib, then it fails to link against cublas. I don't particularly care, since my WSL install is working fine
Partial error output
Creating library exllama_ext.lib and object exllama_ext.exp
half_matmul.cuda.o : error LNK2019: unresolved external symbol cublasHgemm referenced in function "enum cudaError __cdecl half_matmul_cublas_cuda(struct __half const *,struct __half const *,struct __half *,int,int,int,struct cublasContext *)" (?half_matmul_cublas_cuda@@YA?AW4cudaError@@PEBU__half@@0PEAU2@HHHPEAUcublasContext@@@Z)
exllama_ext.pyd : fatal error LNK1120: 1 unresolved externals
ninja: build stopped: subcommand failed.
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:36:15_Pacific_Daylight_Time_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
Also, are you sure you needed the -a pytorch_matmul argument?
Nope, I just monkeyed around with flags I didn't understand until it started working, then I stopped - I bet the force-overrides I did in the code (e.g. act_order) were not needed either.
I pushed an update now to deal with weights without groupsize
With the new update, I don't need any flags, and everything works and is much faster. Here are my results:
System info:
Ryzen 9 7950X3D
64GB DDR5 6000
RTX 4090
Benchmark results:
-- Sequence length: 2048
-- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'perf', 'perplexity']
** Time, Load model: 4.36 seconds
-- Groupsize (inferred): None
-- Act-order (inferred): no
** VRAM, Model: [cuda:0] 15,936.29 MB
-- Inference, first pass.
** Time, Inference: 1.97 seconds
** Speed: 976.28 tokens/second
-- Generating 128 tokens...
** Speed: 26.38 tokens/second
** VRAM, Inference: [cuda:0] 4,014.11 MB
** VRAM, Total: [cuda:0] 19,950.40 MB
-- Loading dataset...
-- Testing..........
** Perplexity: 5.7553
I'm not sure how to get the versions of dependencies it is using, I'm not a Python guy, but looking at my venv folder I have Torch 2.1.0.dev20230523+cu118 and cuda: 11.8
I should be getting more than that, right?
This issue seems to have gotten forgotten, but yeah, you should be getting better speeds than that. A lot has changed in the last three weeks, so you could try with the latest version. Also the Windows native version seems to be faster, at least for some people, using Hardware Accelerated GPU Scheduling. So that could be a thing to try.