lmdeploy [Feature] Layer Wise Calibration and Quantization of Models (To quantize model on Low VRAM GPU)

Motivation

As model is higher, to quantize we require higher gpu to load model whole in gpu, but after quantization it can be fitted in low gram as well, so here two resources are required. So If we can take seperate .bin files and quantize or calibrate each layer, then quantization can be done on low gram gpus as well.

Related resources

No response

Additional context

No response

May 21 '24 06:05 Tushar-ml

Current lmdeploy quantization is actually layer-wise.

May 21 '24 09:05 AllentDan

Actually what I am trying instead of loading the whole model in once, then layerwise quantization (which basically creating) overhead over the CPU, instead of that, layer wise loading and quantizing which maximally takes less cpu and can convert all bin files

May 21 '24 15:05 Tushar-ml

I assume the limitation mainly comes from GPU VRAM instead of CPU memory for users. So, we will not implement this in the short term. But you can pull a merge request to support this feature if you are willing to.

May 22 '24 02:05 AllentDan

Sure, will do that

May 22 '24 06:05 Tushar-ml

Any updates? The issue is going to be closed. Feel free to reopen it if it is still an issue.

Jun 20 '24 10:06 AllentDan