hls4ml Fix initiation interval of pooling and zeropadding layers on Vitis backend

On the Vitis backend and io_stream, zeropadding and pooling layers don't reach II=1 and are slower than for example the Conv layers

Type of change

[x] Bug fix (non-breaking change that fixes an issue)

Tests

Synthesized the zeropadding and pooling models in the pytests. Code achieves II=1 and Latency cycles match the tripcount

Input size = 128x128x3

C-Synthesis results with Vitis HLS 2023.2

Layer	Latency (cycles)	FFs	LUTs
Zeropadding Before	19487	169	596
Zeropadding After	17689	471	1675
Pooling Before	32769	764	1432
Pooling After	16387	795	1392

Tested also on a dummy CNN.

Test Configuration:

Checklist

[x] I have read the guidelines for contributing.
[x] I have commented my code, particularly in hard-to-understand areas.
[ ] I have made corresponding changes to the documentation.
[x] My changes generate no new warnings.
[x] I have installed and run pre-commit on the files I edited or added.
[x] I have added tests that prove my fix is effective or that my feature works.

Dec 04 '24 17:12 steltze

Is it important for the II to be 1? Generally in io_stream conv layers have a larger II. For zero-padding at least the utilization seems to go up.

Jan 15 '25 00:01 jmitrevs

@jmitrevs in the model that I am working with, we only use separable convolutions. if the II=1 for the zeropadding and maxpooling, the depthwise and pointwise convolutions have smaller latecy (cycles).

Depthwise-pointwise latency ~512*512=262144 (which is the image size) < zero-padding latency 787473

Yes, this change allocates more resources but since we are focusing on latency, padding and pooling seem to be the bottlenecks instead of the convolutions which does not make much sense since they don't perform such heavy computations.

I can take some more measurements to get a grasp on how resource utilization scales

Jan 15 '25 09:01 steltze

My experience with this is somewhat opposite:

Pooling with II=CONFIG_T::reuse_factor results in worse latency. Especially if the input is small, regardless of network architecture. For larger inputs I've seen it behave differently depending on the previous layers. I didn't manage to reproduce the issue with the input Stelios used, but I didn't test in all configurations, just a single layer.
For ZeroPadding, the change in fact shouldn't work, but it does. II=1 is not achievable, since the inner loop will write one element to the stream in every iteration. And on Vivado it never results in a lower II. Vitis is different though, there it somehow manages to benefit from pipelining this loop (with or without II=1 added). Must be some magic. However the mystery doesn't end there. If you pipeline the whole function (bad idea, really shouldn't work) it tries to do it, fails completely and says the function will not be pipelined, but the result it produces is even better than pipelining the middle loop. What goes on in this tool is beyond my understanding.

Feb 24 '25 23:02 vloncar

For ZeroPadding, the change in fact shouldn't work, but it does. II=1 is not achievable, since the inner loop will write one element to the stream in every iteration.

yes it can't actually achieve II=1. After some testing I discovered that pipelining the external loop achieves better results than pipelining only the copy-main inner loop which usually did not fix the II problem.

Pipelining the function for large inputs takes ages, so I avoided that entirely.

Feb 27 '25 09:02 steltze