KIVI issues

Why the model inference slowly when Mistral-7B-Instruct-v0.2 apply the kivi?

6

As you can see, the top is the result with kivi 2bit applied, and the bottom is the 16bit result。 With kivi, token generation is reduced by a quarter

lichongod

Where is the falcon_kivi?

2

Due to table7 in paper, the experiments of llama, falcon and mistral are tested, but there are only [llama_kivi.py](https://github.com/jy-yuan/KIVI/blob/main/models/llama_kivi.py) and [mistral_kivi.py](https://github.com/jy-yuan/KIVI/blob/main/models/mistral_kivi.py) but no falcon.py in https://github.com/jy-yuan/KIVI/tree/main/models. How can I get...

Felixvillas

not support evaluation with ROCM

1

I found that it is not support for the follow command line "cd quant && pip install -e ." with ROCM because of the cpp_extention.cpp May I pull request to...

ym-guan

Spport for ChatGLM3

1

Great work! What's your suggestion if I would like to test it on ChatGLM3?

redscv

ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)

I met an error when I tried KIVI, and here is the code. (I modify the example.py in order to run in my server) ```python # LLaMA model with KIVI...

xzwj1699

How to understand the code: triton_quantize_and_pack_along_last_dim(value_states_full[:, :, :1, :].contiguous(), self.group_size, self.v_bits)

3

![微信截图_20240624190119](https://github.com/jy-yuan/KIVI/assets/88588202/f87a942a-484b-4cba-93a7-f01b75e61c21) I don't understand why the input data is value_states_full[:, :, :1, :].contiguous() instead of value_states_full[:, :, :-1, :].transpose(2, 3).contiguous()

chenyehuang

The difference in batch size leads to different results in LongBench testing

5

When I took the LongBench test with `batch_size=1`, I got the same results as in Table 4 of the paper. However, when I increased the batch size, the results were...

Felixvillas

run example.py with llama2-7B-hf only save 500MB kv cache memory conpared to base transformers ?

2

I run the example.py with llama2-7B-hf，set input length 4096 tokens，and output length 100 tokens. config.k_bits = 2, config.v_bits = 2. the kv cache occupy 5.6GB memory，only save about 500MB compared...

riou-chen

Unable to Reproduce Results for LongBench

2

Hello, I ran the code provided for LongBench using the Llama-3-8B-Instruct model but couldn't reproduce the results reported in Table 8 of your paper. Specifically, the full precision baseline model's...

ilil96

Multi GPUs

5

I ran mem_spd_test.py and got the following error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! I did not...

yisunlp

KIVI
KIVI copied to clipboard

Metadata

Why the model inference slowly when Mistral-7B-Instruct-v0.2 apply the kivi?

Where is the falcon_kivi?

not support evaluation with ROCM

Spport for ChatGLM3

ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)

How to understand the code: triton_quantize_and_pack_along_last_dim(value_states_full[:, :, :1, :].contiguous(), self.group_size, self.v_bits)

The difference in batch size leads to different results in LongBench testing

run example.py with llama2-7B-hf only save 500MB kv cache memory conpared to base transformers ?

Unable to Reproduce Results for LongBench

Multi GPUs

← Metadata

Owner

Metadata

KIVI KIVI copied to clipboard

Metadata

← Metadata

Owner

Metadata

KIVI
KIVI copied to clipboard