model-optimization icon indicating copy to clipboard operation
model-optimization copied to clipboard

Sparsity Runtime Integration with TF/TFLite for Latency Improvements

Open alanchiao opened this issue 4 years ago • 29 comments

As suggested here, model pruning currently only provides benefits in model compression/size reduction. Further framework support is necessary to provide latency improvements in TF/TFLite.

alanchiao avatar Dec 06 '19 16:12 alanchiao

When do you think this will be included in tensorflow / TFLite release ? Is there a targeted timeline ? We are planning to do an internal development if this is not expected within this year (2020) based on this.

sujoyrc avatar Mar 30 '20 19:03 sujoyrc

Hi. We're expecting a Q2/Q3 release date, though full TFLite kernel support will be an ongoing process after that (i.e. not all TFLite kernels will have sparse execution support).

Also, we're hoping the current working-from-home situation won't affect things further.

Thanks

raziel avatar Mar 30 '20 20:03 raziel

Thank you

sujoyrc avatar Mar 31 '20 11:03 sujoyrc

Why is this closed ? will this be integrated in next version ?

sujoyrc avatar Apr 03 '20 19:04 sujoyrc

Reopened. Will not be integrated necessarily in next release.

alanchiao avatar Apr 07 '20 18:04 alanchiao

Will sparse models ever result in smaller/compressed *.tflite models? This would be a huge plus for low power use cases as it would reduce I/O.

Currently I'm working with a quantized 14MB model but if pruned & compressed it could go down to 2MB and be able to fit in the SRAM of some MCUs.

shariq-audiofocus avatar Apr 23 '20 04:04 shariq-audiofocus

Will sparse models ever result in smaller/compressed *.tflite models? This would be a huge plus for low power use cases as it would reduce I/O.

Currently I'm working with a quantized 14MB model but if pruned & compressed it could go down to 2MB and be able to fit in the SRAM of some MCUs.

Same here! Currently my *tflite model and its sparse counterpart have the same storage requirements.

If TFLite could detect the zeros and change their type to uint8, this would make a huge difference on model size (MBs).

paulaksm avatar Apr 27 '20 17:04 paulaksm

@paulaksm @shariq-audiofocus have you tried structural pruning instead? if you worry about storage only why not to consider gzip compressing for the model file? (like cryptopp gzip)

gordinmitya avatar Jun 20 '20 06:06 gordinmitya

@gordinmitya Thanks, I hadn't heard of structural pruning, seems like that could lead to smaller tflite binaries if it eliminates entire filters. Is structural pruning on the model-optimization roadmap?

Re: storage - I'm not worried about offline storage. I'm worried about latency & power usage during inference on tiny edge devices (probably MCUs). ARM is developing processors [1] that can do online decompression of weights on-the-fly during inference. This is interesting because now you can fit larger models in memory by utilizing their compression technique. If the model fits in memory (SRAM) you get lower latency & power usage. I'm wondering if the model-optimization & TFLite team are thinking about this or if it's outside their scope.

[1] https://www.theregister.com/2020/02/10/arm_cortex_m_ai_accelerator/ - "To fit this all into a small memory and silicon footprint, the microNPU can decompress trained INT8 models on the fly for inference."

shariq-audiofocus avatar Jun 22 '20 17:06 shariq-audiofocus

Structural pruning is really important to my team, too. The current zero-weight pruning for compression is nice but we're far more interested in reduced file sizes to be able to fit models into SRAM instead of DRAM.

I'm hopeful that this library will eventually support structural pruning- but so far I haven't seen any mention of it.

willbattel avatar Jun 23 '20 00:06 willbattel

Any updates on this? Can we expect latency improvements for our pruned models?

edumotya avatar Aug 26 '20 08:08 edumotya

Can you estimate its release date for inference time optimization?

pedroska777 avatar Aug 27 '20 22:08 pedroska777

Sorry for keeping you waiting. We're actively working on making the initial release of sparse inference support in TFLite. It's hard to give an exact date but hopefully before Q3 ends. Thanks for your patience!

liyunlu0618 avatar Aug 27 '20 22:08 liyunlu0618

A spoiler: https://github.com/tensorflow/model-optimization/blob/master/tensorflow_model_optimization/python/examples/sparsity/keras/mnist/mnist_e2e.py

Please note that we're still finalizing the API. The workflow in the released version may look different.

liyunlu0618 avatar Aug 27 '20 23:08 liyunlu0618

@liyunlu0618 looking at your approach right now and training to implement that. does this latency improved inference also work for Conv and not only Dense filters (how would one do it for Conv filters)? Also why is the block [4,1] exactly. How does that ensuring inference time improvements? Thanks!

ghost avatar Aug 28 '20 16:08 ghost

For the Conv op we only support these hosted models at the moment: https://github.com/google-research/google-research/tree/master/fastconvnets

We need the block config to use SIMD instructions on Arm neon architecture. Feel free to check out the kernel here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/optimized/neon_tensor_utils.cc#L1962-L1990

liyunlu0618 avatar Aug 28 '20 19:08 liyunlu0618

Hi, are there updates on this?

js14083 avatar Dec 29 '20 12:12 js14083

@alanchiao any update process?

dathudeptrai avatar Feb 24 '21 03:02 dathudeptrai

This is currently available as an experimental feature in TFLite.

For sparse CNNs, it needs to run with the XNNPack delegate. Please refer to this.

For sparse RNNs and transformers, TFLite has built-in support. This has a few examples.

We'll have formal blogposts/docs soon. In the meanwhile if you could provide more details on your use case, I can suggest how to apply this optimization accordingly. Key points that are helpful:

  1. Model type and key operators
  2. Hardware backend you're targeting
  3. Whether to combine with quantization
  4. Target performance/accuracy numbers

liyunlu0618 avatar Feb 24 '21 18:02 liyunlu0618

@liyunlu0618 thanks for your information, I will play around it a bit :D. Do you know when the documentation will be finished ?

dathudeptrai avatar Feb 25 '21 06:02 dathudeptrai

mark

aa12356jm avatar Mar 01 '21 03:03 aa12356jm

Hello,

I was wondering if there is the intention of adding structural prunning support for conv layers (in addition to dense layers) ? Is this something possible to do or some fundamental issue prohibits it ? Thanks

eejlny avatar Mar 01 '21 14:03 eejlny

@liyunlu0618 - My use case:

  1. Online, Streaming, Speech-Enhancement-like Task. Input Audio -> Dense -> LSTM -> Dense -> Output Audio. During training the Dense layers are actually CONV layers but I don't think that matters. Current model is ~8MB after int8 quantization, would like < ~4MB with sparsity/pruning features.
  2. Now: processor on an iPhone 11, or possibly edge TPU (Coral Dev Board). Later (2022): Syntiant's NDP120 or NDP500 chip [1].
  3. Yes need quantization + compression via pruning.
  4. Last time I checked quantization had minimal or no effect, 8dB -> 7.9dB. Hoping for similar results with 50% sparsity/structured pruning compression.

[1] https://www.syntiant.com/ndp120

shariq-audiofocus avatar Mar 29 '21 17:03 shariq-audiofocus

Any chance we will get support for pruned CNNs on other TFLite delegates? We rely on the NNAPI and CoreML delegates for quick and efficient inference on Android and iOS, respectively, but so far it looks like XNNPack is the only supported delegate.

willbattel avatar Jun 24 '21 20:06 willbattel

I have the same issue here. After pruning, I got the same size model and the same inference time. even I convert to tflite but it can run on CPU so, the inference time is still not good. XNNPack doesn't not support my network. So, could you tell me what can I do next for improve the inference time with my pruned model ? Thank you so much !

STAROFWIND avatar Mar 30 '22 05:03 STAROFWIND

Is there any update on this topic? What's the correct way to improve the inference time of a model with pruning?

zoythum avatar Nov 16 '22 15:11 zoythum

It seems still there's no proper solution to improve the inference time on pruned model

sampathrajapaksha avatar Mar 06 '23 08:03 sampathrajapaksha

@sampathrajapaksha - We've found the best approach is to do knowledge distillation (KD) to shrink your model and therefore improve inference time. This paper has some good ideas: https://arxiv.org/pdf/1910.01108.pdf%3C/p%3E and shows you can do it with minimal performance degradation. We're still experimenting but these seems to be a better path forward rather than relying on pruning optimizations

shariq-audiofocus avatar Mar 06 '23 19:03 shariq-audiofocus

@shariq-audiofocus Thank you very much for sharing this with me. My use case is quite similar to yours. I'll read this and see how I can apply this to reduce inference time

sampathrajapaksha avatar Mar 06 '23 20:03 sampathrajapaksha