grok-1
grok-1 copied to clipboard
Quantization with less loss with Expert Offloading? Can we imitate Mixtral-offloading?
Previous issues gives: we can compile models for portable use with local gpu, but with heavy quantization that sacrifices the performance.
One example is comparing GGUF of Mixtral 8x7b 4bits from the Bloke on Huggingface: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF and some online Mixtral sources. I guess this MoE can also get lossy when simplified for too much.
However, for Mixtral MoE, https://github.com/dvmazur/mixtral-offloading gives a less harsh quantization, I used which to build some apps with better results at least for my case. Of course bitsandbytes needs to be compiled with proper backend to run efficiently on Cuda (see https://github.com/TimDettmers/bitsandbytes/issues/112) with proper settings of the package Accelerate (see: accelerate config).
Since this is also using MoE , can we build something similar?
Thanks to Issue #156 says there is already a quantization code. I knew one of the efficient way is to use GGUF and llama.cpp or candle (https://github.com/ggerganov/llama.cpp). If we build GGUF and compile it right (look at the compilation settings of llama.cpp)
Issue #15 wanted 4bit quantization, gave an estimation of size. #42 gave comments on the immense original size.
Here I do not know whether it is easier to make offloading via llama.cpp or maybe we need to break the model class object to move things using torch in python or using candle (the expert between cpu and gpu). Another candidate, ONNX(https://onnx.ai/) but I do not know whether the compliation yield problem if the model uses some non-regular class or structural definitions. Note that even the MoE offloading for mixtral works on colab, it does not work in some GPUs with older infrastructure (works on my 4090s but not M6000 24GB).
From Huggingface/transformers side, the grok is loaded and it seems the class objects are written (at least ) https://github.com/huggingface/transformers/issues/29704
The cpu code pull request by louiehelm https://github.com/xai-org/grok-1/pull/235
I am no expert on moving weights btw cpu and gpu so I do not know what the next step would be or if this seems possible. If anyone have insights or some implementations it would be nice.
Some updates:
Mean while we use quantization and offloading, it may also be feasible to do distillation for the model.
A recent paper has an estimation of the size of gpt that is currently deployed (and surpisingly not that immense) (I need to find the link and post it here).
On distillation, here https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs looks like a good source (although I watched other sources as well, many talks by companies seems to reveal very little of what is done).
Still I have neither the computing nor the ability to code something to compress it in these ways.