text-generation-inference GPTQ quantization for MPT-30 models

Feature request

I would like to raise a feature request for quantisation of MPT–30b models.

Motivation

MPT-30b models with larger number of token size take huge space in VRAM. Hence can we have a quantisation for this model with GPTQ so that the inference time is also less and memory consumption is also low.

Your contribution

I am a bit, unsure about how the quantisation works with regards to different model. As per the latest version, I saw that falcon-40b support for GPTQ has been added. I can contribute by raising a PR or we can collaborate and work on it together. Thanks and cheers.

Jul 05 '23 10:07 ankit201

Hey, GPTQ should work mostly out of the box forMPT.

You just need to run the script (this should work, to the potential naming of the layers inside the mode).

Then you just need to write the proper thing in load_col. This is where it might get tricky. Basically the weights are stored as [Q, K, V] instead of [(q_head, k_head, v_head), (q_head, k_head, v_head), ...].

This makes sharding a bit trickier than usual, and it's going to become even hairier for GPTQ. But it's really only a trick about indexing the proper things inside the tensors using get_slice.

Jul 06 '23 16:07 Narsil

@Narsil are you planning to rollout ,GPTQ implementation for MPT-30 B.the model has good support of 8 K Input tokens.Current implementation also has memory fragmentation issues. For flash causal LM its already resolved.

Jul 07 '23 10:07 MohnishJain

are you planning to rollout ,GPTQ implementation for MPT-30 B

No, but if you figure out the sharding logic, we are accepting PRs. I tried to provide initial guidance in my previous comment

Jul 07 '23 11:07 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

May 08 '24 01:05 github-actions[bot]

text-generation-inference text-generation-inference copied to clipboard

GPTQ quantization for MPT-30 models

Feature request

Motivation

Your contribution

text-generation-inference
text-generation-inference copied to clipboard