hls4ml icon indicating copy to clipboard operation
hls4ml copied to clipboard

Fix initiation interval of pooling and zeropadding layers on Vitis backend

Open steltze opened this issue 1 year ago • 4 comments

On the Vitis backend and io_stream, zeropadding and pooling layers don't reach II=1 and are slower than for example the Conv layers

Type of change

  • [x] Bug fix (non-breaking change that fixes an issue)

Tests

Synthesized the zeropadding and pooling models in the pytests. Code achieves II=1 and Latency cycles match the tripcount

Input size = 128x128x3

C-Synthesis results with Vitis HLS 2023.2

Layer Latency (cycles) FFs LUTs
Zeropadding Before 19487 169 596
Zeropadding After 17689 471 1675
Pooling Before 32769 764 1432
Pooling After 16387 795 1392

Tested also on a dummy CNN.

Test Configuration:

Checklist

  • [x] I have read the guidelines for contributing.
  • [x] I have commented my code, particularly in hard-to-understand areas.
  • [ ] I have made corresponding changes to the documentation.
  • [x] My changes generate no new warnings.
  • [x] I have installed and run pre-commit on the files I edited or added.
  • [x] I have added tests that prove my fix is effective or that my feature works.

steltze avatar Dec 04 '24 17:12 steltze

Is it important for the II to be 1? Generally in io_stream conv layers have a larger II. For zero-padding at least the utilization seems to go up.

jmitrevs avatar Jan 15 '25 00:01 jmitrevs

@jmitrevs in the model that I am working with, we only use separable convolutions. if the II=1 for the zeropadding and maxpooling, the depthwise and pointwise convolutions have smaller latecy (cycles).

image

Depthwise-pointwise latency ~512*512=262144 (which is the image size) < zero-padding latency 787473

Yes, this change allocates more resources but since we are focusing on latency, padding and pooling seem to be the bottlenecks instead of the convolutions which does not make much sense since they don't perform such heavy computations.

I can take some more measurements to get a grasp on how resource utilization scales

steltze avatar Jan 15 '25 09:01 steltze

My experience with this is somewhat opposite:

  • Pooling with II=CONFIG_T::reuse_factor results in worse latency. Especially if the input is small, regardless of network architecture. For larger inputs I've seen it behave differently depending on the previous layers. I didn't manage to reproduce the issue with the input Stelios used, but I didn't test in all configurations, just a single layer.
  • For ZeroPadding, the change in fact shouldn't work, but it does. II=1 is not achievable, since the inner loop will write one element to the stream in every iteration. And on Vivado it never results in a lower II. Vitis is different though, there it somehow manages to benefit from pipelining this loop (with or without II=1 added). Must be some magic. However the mystery doesn't end there. If you pipeline the whole function (bad idea, really shouldn't work) it tries to do it, fails completely and says the function will not be pipelined, but the result it produces is even better than pipelining the middle loop. What goes on in this tool is beyond my understanding.

vloncar avatar Feb 24 '25 23:02 vloncar

For ZeroPadding, the change in fact shouldn't work, but it does. II=1 is not achievable, since the inner loop will write one element to the stream in every iteration.

yes it can't actually achieve II=1. After some testing I discovered that pipelining the external loop achieves better results than pipelining only the copy-main inner loop which usually did not fix the II problem.

Pipelining the function for large inputs takes ages, so I avoided that entirely.

steltze avatar Feb 27 '25 09:02 steltze