djl
djl copied to clipboard
support dynamic quantization of pytorch
Description
dynamic quantization (weights quantized with activations read/stored in floating point and quantized for compute)
Will this change the current api?
- Model.quantize
Who will benefit from this enhancement? Today, PyTorch supports the following backends for running quantized operators efficiently:
-
x86 CPUs with AVX2 support or higher (without AVX2 some operations have inefficient implementations), via x86 optimized by fbgemm and onednn (see the details at RFC)
-
ARM CPUs (typically found in mobile/embedded devices), via qnnpack
-
(early prototype) support for NVidia GPU via TensorRT through fx2trt (to be open sourced)