grok-1 icon indicating copy to clipboard operation
grok-1 copied to clipboard

How to use 8-bit quantization inference with grok

Open eshaojun opened this issue 1 year ago • 5 comments

i want to inference by 8*A6000(50G),how can i use 8-bit?

eshaojun avatar Mar 22 '24 02:03 eshaojun

Pytorch convert from jax likely will open this model to be more accessible to 8-bit quantization. Which I think has been done. https://huggingface.co/hpcai-tech/grok-1 and https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/grok-1 for example. This is just for the base model with the open weight from what we see from ColossalAi. No quants. Also, as to if this is an effective conversation, I wouldn't know. Though this might open it up for you to use and try something like bitandbytes to see if it would work. Or other libraries that do quantization.

I don't know if quantization in jax is a thing. I'm not highly familiar with jax.

davidearlyoung avatar Mar 23 '24 22:03 davidearlyoung

Found this:

  • https://github.com/xai-org/grok-1/issues/202 'keyfan' and the 'LagPixelLOL' related links from the page of the mentioned issue may have related info of interest for you. Saw something mentioned in regards of 8-bit support from LabPixelLOL's github link from the same issue.

davidearlyoung avatar Mar 23 '24 23:03 davidearlyoung

Those models say "For now, a 8x80G multi-GPU machine is required", so I fail to see what's the benefit

Sequential-circuits avatar Mar 25 '24 11:03 Sequential-circuits

Quantization, when done right, can drop the disk and mem usage quite considerably. Moving this into cheaper inference and accessibility realms. On the downsides, quantization can add some time for use and a bit a model accuracy loss. Some quant jobbies on some models have been measured and observed with loss that is barely noticeable for most tasks. And for some the added latency for token gen rates is within tolerance/reason.

It seems the best possible so far for grok-1 is likely IQ3 quantization which hits on the disk and mem usage at 120+ GB roughly. Which is a lot more accessible than the 300~ GB needed for the open original 16-bit float point.

davidearlyoung avatar Mar 31 '24 19:03 davidearlyoung

@eshaojun If you have not already, you may want to do some research into ggerganov's llama.cpp github repo and accompanying GGUF quantization system. May prove useful in regards of your interest. There is grok-1 talk on the discussions/pull requests/and issues boards on the mentioned llama.cpp repo in regards of grok-1 quantization support. Historically, llama.cpp seems to be one of the most versatile, capable LLM quantization systems and public quant support community's I've seen out there so far. You can use the GGUF and Llama.cpp lib's to target quantization translation of model data types to different quantized data types. Such as quant 8-bit for example if your wanting to do that.

davidearlyoung avatar Mar 31 '24 19:03 davidearlyoung