lmdeploy icon indicating copy to clipboard operation
lmdeploy copied to clipboard

[Feature] Layer Wise Calibration and Quantization of Models (To quantize model on Low VRAM GPU)

Open Tushar-ml opened this issue 1 year ago • 4 comments

Motivation

As model is higher, to quantize we require higher gpu to load model whole in gpu, but after quantization it can be fitted in low gram as well, so here two resources are required. So If we can take seperate .bin files and quantize or calibrate each layer, then quantization can be done on low gram gpus as well.

Related resources

No response

Additional context

No response

Tushar-ml avatar May 21 '24 06:05 Tushar-ml

Current lmdeploy quantization is actually layer-wise.

AllentDan avatar May 21 '24 09:05 AllentDan

Actually what I am trying instead of loading the whole model in once, then layerwise quantization (which basically creating) overhead over the CPU, instead of that, layer wise loading and quantizing which maximally takes less cpu and can convert all bin files

Tushar-ml avatar May 21 '24 15:05 Tushar-ml

I assume the limitation mainly comes from GPU VRAM instead of CPU memory for users. So, we will not implement this in the short term. But you can pull a merge request to support this feature if you are willing to.

AllentDan avatar May 22 '24 02:05 AllentDan

Sure, will do that

Tushar-ml avatar May 22 '24 06:05 Tushar-ml

Any updates? The issue is going to be closed. Feel free to reopen it if it is still an issue.

AllentDan avatar Jun 20 '24 10:06 AllentDan