gpt-fast issues

RuntimeError: CUDA error: named symbol not found

3

I am trying to quantize the llama-2-7b-chat-hf using the gpt fast using:- python quantize.py --mode int4 --groupsize 32 on Kaggle using Kaggle T4*2 GPU. I have installed pytorch nightly using:-...

ce1190222

Does `gpt-fast` work on V100 GPUs?

2

Everything works on my A6000s and A100s, but not on the older V100 (says compute capability is low). Are there plans to add support for the legacy devices? Thanks!

RomanKoshkin

Try Tensor Parallel on a server equipped with two V100 linked by NVLINK, but got a performance degradation

8

I'm attempting to deploy llama2-7b-chat-hf on a server equipped with two V100 GPUs linked by NVLink, but I've encountered an issue where the tokens per second (token/s) performance is worse...

duanzhaol

batching/dynamic batching

2

Thanks for the amazing work! It really is super fast at bs=1. Can batch usecases, or dynamic batching be supported?

nivibilla

Bandwidth achieved for INT8 is much smaller than FP16

3

I run CodeLlama 7B: when I use FP16, bandwidth achieved is 700 GB/s; however, when I use INT8, it is 197 GB/s. I run the model on one AMD MI210...

yafehlis

[not for land] Temp fix to make GPTQ work

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #83 * #104 Summary: this makes gptq work Test Plan: Reviewers: Subscribers: Tasks: Tags:

HDCharles

CLA Signed

Code is extremely slow!

2

I am using AMD MI210s. After loading the models, the following steps are extremely slow (see screenshot). It turned out the Compilation time is 270 seconds. Could you please help...

yafehlis

Size mismatch error occurs when loading models quantized by GPTQ

2

Hi, thanks for building this wonderful open-source project! I am using GPTQ to first quantize a llama2-7b-chat-hf model: ```bash python quantize.py --checkpoint_path checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth --mode int4-gptq --calibration_tasks wikitext --calibration_seq_length 2048 ```...

sdc17

torch.compile leads to OOM with different prompts.

Given a model compiled with: ```py model = torch.compile(model, mode="reduce-overhead", fullgraph=True, dynamic=True) ``` where the bulk of the task is computing next token logits for different prompts (MMLU), memory usage...

samuelstevens

Activation quantization support

1

Many papers have recently addressed the issues with quantization of activations for LLMs. Examples: https://github.com/ziplab/QLLM?tab=readme-ov-file#%F0%9F%9B%A0-install https://github.com/mit-han-lab/lmquant?tab=readme-ov-file#efficiency-benchmarks https://github.com/spcl/QuaRot Is it possible to add activation quantization support to gpt-fast for even more...

ayyoobimani

gpt-fast
gpt-fast copied to clipboard

Metadata

RuntimeError: CUDA error: named symbol not found

Does `gpt-fast` work on V100 GPUs?

Try Tensor Parallel on a server equipped with two V100 linked by NVLINK, but got a performance degradation

batching/dynamic batching

Bandwidth achieved for INT8 is much smaller than FP16

[not for land] Temp fix to make GPTQ work

Code is extremely slow!

Size mismatch error occurs when loading models quantized by GPTQ

torch.compile leads to OOM with different prompts.

Activation quantization support

← Metadata

Owner

Metadata

gpt-fast gpt-fast copied to clipboard

Metadata

← Metadata

Owner

Metadata

gpt-fast
gpt-fast copied to clipboard