Fix initiation interval of pooling and zeropadding layers on Vitis backend
On the Vitis backend and io_stream, zeropadding and pooling layers don't reach II=1 and are slower than for example the Conv layers
Type of change
- [x] Bug fix (non-breaking change that fixes an issue)
Tests
Synthesized the zeropadding and pooling models in the pytests. Code achieves II=1 and Latency cycles match the tripcount
Input size = 128x128x3
C-Synthesis results with Vitis HLS 2023.2
| Layer | Latency (cycles) | FFs | LUTs |
|---|---|---|---|
| Zeropadding Before | 19487 | 169 | 596 |
| Zeropadding After | 17689 | 471 | 1675 |
| Pooling Before | 32769 | 764 | 1432 |
| Pooling After | 16387 | 795 | 1392 |
Tested also on a dummy CNN.
Test Configuration:
Checklist
- [x] I have read the guidelines for contributing.
- [x] I have commented my code, particularly in hard-to-understand areas.
- [ ] I have made corresponding changes to the documentation.
- [x] My changes generate no new warnings.
- [x] I have installed and run
pre-commiton the files I edited or added. - [x] I have added tests that prove my fix is effective or that my feature works.
Is it important for the II to be 1? Generally in io_stream conv layers have a larger II. For zero-padding at least the utilization seems to go up.
@jmitrevs in the model that I am working with, we only use separable convolutions. if the II=1 for the zeropadding and maxpooling, the depthwise and pointwise convolutions have smaller latecy (cycles).
Depthwise-pointwise latency ~512*512=262144 (which is the image size) < zero-padding latency 787473
Yes, this change allocates more resources but since we are focusing on latency, padding and pooling seem to be the bottlenecks instead of the convolutions which does not make much sense since they don't perform such heavy computations.
I can take some more measurements to get a grasp on how resource utilization scales
My experience with this is somewhat opposite:
- Pooling with
II=CONFIG_T::reuse_factorresults in worse latency. Especially if the input is small, regardless of network architecture. For larger inputs I've seen it behave differently depending on the previous layers. I didn't manage to reproduce the issue with the input Stelios used, but I didn't test in all configurations, just a single layer. - For ZeroPadding, the change in fact shouldn't work, but it does.
II=1is not achievable, since the inner loop will write one element to the stream in every iteration. And on Vivado it never results in a lower II. Vitis is different though, there it somehow manages to benefit from pipelining this loop (with or withoutII=1added). Must be some magic. However the mystery doesn't end there. If you pipeline the whole function (bad idea, really shouldn't work) it tries to do it, fails completely and says the function will not be pipelined, but the result it produces is even better than pipelining the middle loop. What goes on in this tool is beyond my understanding.
For ZeroPadding, the change in fact shouldn't work, but it does. II=1 is not achievable, since the inner loop will write one element to the stream in every iteration.
yes it can't actually achieve II=1. After some testing I discovered that pipelining the external loop achieves better results than pipelining only the copy-main inner loop which usually did not fix the II problem.
Pipelining the function for large inputs takes ages, so I avoided that entirely.