quanto Why the quantized net is slower?

Why the quantized net is slower?

Open theguardsgod opened this issue 10 months ago • 1 comments

batch_size: 1, torch_dtype: fp32, unet_dtype: int8 in 3.754 seconds. Memory: 5.240GB.

batch_size: 1, torch_dtype: fp32, unet_dtype: None in 3.378 seconds. Memory: 6.073GB.

I'm using the example code for stable diffusion, but the inference time is slower for quantized int8 version (I've also test the speed on my own model, and quantization brings larger VRAM and slower inference time). Why is that case?

Apr 20 '24 07:04 theguardsgod

same observation

May 07 '24 11:05 canamika27

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Jun 07 '24 01:06 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

Jun 13 '24 01:06 github-actions[bot]

same observation in NVIDIA A30

Jul 18 '24 23:07 newgrit1004

quanto quanto copied to clipboard

Why the quantized net is slower?

quanto
quanto copied to clipboard