eland
eland copied to clipboard
Remove dynamic quantization option for PyTorch models at upload
Dynamic quantization of PyTorch models has proven to be a challenge for two reasons.
(1) Dynamic quantization ties the traced TorchScript model to a particular architecture and makes it non-portable. For example, tracing the model (by using the upload CLI) on an ARM-based M-series Apple processor will make it non-portable to an Intel CPU, and vice versa. Tracing a model in this way also means that any Intel-based optimisations cannot be used. The best practice is to trace the model on the same CPU architecture as the target inference processors. Adding in GPU support adds a further complexity and eland is currently not even capable of tracing with GPU (for now).
(2) "Blind" dynamic quantization at upload time could also be considered as an anti-pattern/not a best practice. Quantization can often damage the accuracy of a model and doing quantization blindly, without evaluating the model afterwards, can produce surprising results at inference.
For these reasons, we believe it is safest to remove dynamic quantization as an option. If users would like to use quantized models, they can do so in PyTorch or transformers
directly, and upload their new model with eland's Python methods (as opposed to using the CLI).
Dynamic quantisation is controlled by the --quantize
parameter to the eland_import_hub_model
script. It has always been considered an advanced option and should now be deprecated. The script should emit an warning when the option is used describing the hardware incompatibility problem
.
To understand exactly what happens when quantising on a different architecture to the one used at evaluation I used the eland_import_hub_model
to trace a quantised model on an M1 mac and upload it to an X86 linux server for evaluation.
Tracing the model with the --quantize
option fails on an M1 Mac with the error:
RuntimeError: Didn't find engine for operation quantized::linear_prepack NoQEngine
Full stack trace
Traceback (most recent call last): File "/usr/local/bin/eland_import_hub_model", line 8, inThe models sentence-transformers/msmarco-MiniLM-L-12-v3
and dslim/bert-base-NER
were tested
docker run -it --rm elastic/eland \
eland_import_hub_model \
--cloud-id $CLOUD_ID \
-u elastic -p $CLOUD_PWD \
--hub-model-id sentence-transformers/msmarco-MiniLM-L-12-v3 \
--task-type text_embedding \
--quantize
docker run -it --rm elastic/eland \
eland_import_hub_model \
--cloud-id $CLOUD_ID \
-u elastic -p $CLOUD_PWD \
--hub-model-id dslim/bert-base-NER \
--task-type text_embedding \
--quantize
The 8.9 docker image with version 1.13.1 of PyTorch was used in this test.