serving
serving copied to clipboard
Enable AMP (Automatic Mixed Precision ) in Tensorflow Serving.
Describe the problem the feature is intended to solve
AMP accelerates inference significantly.
Describe the solution
A flag for enabling AMP
Describe alternatives you've considered
There is no alternative with Tensorflow Serving
Additional context
N/A
This I think should be very high priority ( at the least FP16) , otherwise the case for TFS becomes weak.
The AMP is mainly target on training instead of serving.(https://www.tensorflow.org/guide/keras/mixed_precision)
Have you observed the significant performance difference for serving as well? If so, could you share the benchmark and related numbers?
How do I turn on AMP in serving ? I have observed 50% improvement in processiong time with fp16 over fp32 without any noticeable change in accuracy. Reduced precision is one of the corner stones of Nvidia TensorRT, etc. See this one also - https://medium.com/@whatdhack/neural-network-inference-optimization-8651b95e44ee .
Is there a way to do the following in TFS ?
config = tf.ConfigProto()
config.graph_options.rewrite_options.auto_mixed_precision = 1
sess = tf.Session(config=config)
I just ran some tests on a MaskRCNN Saved Model in nvcr.io/nvidia/tensorflow:20.03-tf1-py3. TF_ENABLE_AUTO_MIXED_PRECISION seems to work very well for inference - requires less memory and speeds up significantly. The following are the numbers , if you need more convincing.
TF_ENABLE_AUTO_MIXED_PRECISION =1, memory = 4.2GB , inference speed 0.25 sec .
vs
memory = 7.1 GB , inference speed 0.53 sec .
Thanks for the experiments and numbers! Based on the number, we could add the option. I will also follow up with our GPU team.
Any update here? Also, is it possible to enable JIT/XLA as well like https://github.com/tensorflow/serving/issues/1515 ?
Any update here?
I'd really appreciate this feature being added too
Hi, Any updates here?
Hi, Any updates here?