[Feature] Layer Wise Calibration and Quantization of Models (To quantize model on Low VRAM GPU)
Motivation
As model is higher, to quantize we require higher gpu to load model whole in gpu, but after quantization it can be fitted in low gram as well, so here two resources are required. So If we can take seperate .bin files and quantize or calibrate each layer, then quantization can be done on low gram gpus as well.
Related resources
No response
Additional context
No response
Current lmdeploy quantization is actually layer-wise.
Actually what I am trying instead of loading the whole model in once, then layerwise quantization (which basically creating) overhead over the CPU, instead of that, layer wise loading and quantizing which maximally takes less cpu and can convert all bin files
I assume the limitation mainly comes from GPU VRAM instead of CPU memory for users. So, we will not implement this in the short term. But you can pull a merge request to support this feature if you are willing to.
Sure, will do that
Any updates? The issue is going to be closed. Feel free to reopen it if it is still an issue.