qwopqwop200

Results 86 comments of qwopqwop200

Please see https://github.com/zyddnys/manga-image-translator/issues/25 for information about the dataset. Also, the dataset has a capacity of about 3.7tb(danbooru2020 3.4tb,imagenet 150gb,coco 81gb,manga109 6gb) Also, if you compress and store the file in...

I have no understanding of GPTQ quantization algorithms. However, for GPTQ quantization, refer to the following: [GPTQ](https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/main/gptq.py) [Quantizer](https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/main/quant.py#L10) [llama_sequential](https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/main/llama.py#L24) For model inference, you can refer to: [vecquant4matmul_cuda](https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/main/quant_cuda_kernel.cu#L264) and [VecQuant4MatMulKernel](https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/main/quant_cuda_kernel.cu#L295) [QuantLinear](https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/main/quant.py#L128)...

According to this [paper](https://arxiv.org/pdf/2212.09720.pdf), 3 or 2 bit quantization is not a very good idea. Also, unlike GPTQ, RTN can perform poorly on LLM.

> That does not consider groupping/binning - which is already being used now (with QK=32 --- bin of size 32) https://arxiv.org/pdf/2206.09557.pdf, https://arxiv.org/pdf/2210.17323.pdf and [GPTQ](https://arxiv.org/abs/2210.17323) (table 4 - last row, table...

remove this line https://github.com/AutoGPTQ/AutoGPTQ/blob/main/auto_gptq/nn_modules/qlinear/qlinear_tritonv2.py#L146 and this line https://github.com/AutoGPTQ/AutoGPTQ/blob/main/auto_gptq/nn_modules/triton_utils/dequant.py#L79 I think this will probably work.

If sym=False you should force it to be saved in v2 format. It won't work in v1. https://github.com/AutoGPTQ/AutoGPTQ/pull/559/files#diff-f4f987d3fa40acd4e45fc6fcd85d17228a61466910d12dfec21c918674979bacR105

I can't validate this idea at the moment as huggingface is throwing a 504. It's easy to validate: check if the weights of qzeros in the v2 model and the...

test code ``` from transformers import AutoTokenizer, TextGenerationPipeline from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig import logging logging.basicConfig( format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S" ) pretrained_model_dir = "facebook/opt-125m" quantized_model_dir = "opt-125m-4bit"...

Benchmarked on opt2.7b, but not as slow as this.

No I wouldn't have broken the inference. I've confirmed that it works for at least 128groupsize.