model-optimization Pruning does not reduce inference time.

System information

TensorFlow version (you are using): 2.3.0
Are you willing to contribute it (Yes/No): No

Motivation Currently pruning in tensorflow_model_optimization does not result in a reduction in inference time. Even though the pruned model is sparser than the original, the inference time remains the same. (This was tested on a Resnet model.)

Describe the feature Pruning sets the weights to zero, but does not prune the networks edges. Update the pruning feature such that the new sparse weights result in a corresponding increase in speed.

Jan 08 '21 20:01 ectg

I am currently facing the same problem, I did pruning on the SSD model and the inference time is the same. Pruning guarantees model size compression. For this reason I am exploring quantization. Quantization atleast reduces CPU and GPU latencies which should improve the inference time I guess.

Jan 11 '21 10:01 sachinkmohan

Using quantization instead is definitely an alternative solution.

You can also check out this blogpost: https://ai.googleblog.com/2021/03/accelerating-neural-networks-on-mobile.html

For CNN models, you can use pruning to train the model and deploy it with TFLite + the XNNPack delegate enabled. There's certain restrictions on the graph architecture, documented here: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/delegates/xnnpack#sparse-inference

Please give it a try and let us know how it works. Thanks!

Apr 20 '21 21:04 liyunlu0618

An alternative to achieve faster inference times with pruning is through structured pruning: https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_sparsity_2_by_4

Jan 10 '24 15:01 Dutra-Apex