GPTQ-triton Cuda vs Triton on an RTX 3060 12GB

Cuda vs Triton on an RTX 3060 12GB

Open 1aienthusiast opened this issue 1 year ago • 12 comments

cuda: 35tokens/s triton: 5tokens/s

I used ooba's webui only for cuda, because I've been unable to get triton to work with ooba's webui, I made sure i used the same parameters as in the command for triton:

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /home/username/miniconda3/envs/textgen/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/username/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading llama-7b-4bit-triton...
Traceback (most recent call last):
  File "/home/username/AI/2oobabooga/text-generation-webui/server.py", line 275, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/username/AI/2oobabooga/text-generation-webui/modules/models.py", line 102, in load_model
    model = load_quantized(model_name)
  File "/home/username/AI/2oobabooga/text-generation-webui/modules/GPTQ_loader.py", line 114, in load_quantized
    model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, kernel_switch_threshold=threshold)
  File "/home/username/AI/2oobabooga/text-generation-webui/modules/GPTQ_loader.py", line 36, in _load_quant
    make_quant(model, layers, wbits, groupsize, faster=faster_kernel, kernel_switch_threshold=kernel_switch_threshold)
TypeError: make_quant() got an unexpected keyword argument 'faster'

For triton I used this command:

python3.10 generate.py --model ./ --quant --prompt "Write a story about a duck: Once upon a time there was a duck" --temperature 1.99 --top-p 0.18 --repetition-penalty 1.15 --max-length 128

I used the 7B-4bit model (i quantized it for triton using python3.10 convert_weights.py --quant ~/AI/2oobabooga/text-generation-webui/models/llama-7b-4bit/llama-7b-4bit.safetensors --model ~/AI/oobabooga/text-generation-webui/models/LLaMA-7B/ --output ./ )

GPU: RTX 3060 12GB OS: Debian

Apr 02 '23 11:04 1aienthusiast

GPTQ-triton GPTQ-triton copied to clipboard

Cuda vs Triton on an RTX 3060 12GB

GPTQ-triton
GPTQ-triton copied to clipboard