hls4ml icon indicating copy to clipboard operation
hls4ml copied to clipboard

Relu merge optimizer pass

Open oliviaweng opened this issue 2 years ago • 8 comments

Description

We introduce an hls4ml optimizer pass for merging the ReLU layer into the Dense/Conv2D layers when ReLU immediately follows them---a frequently encountered pattern in neural networks (NNs). NNs in hls4ml are spatially laid out using dataflow stages to implement each layer, which are linked together by FIFOs. These FIFOs can cost BRAMs, LUTs, and/or FFs. By default in hls4ml, each ReLU is implemented as its own dataflow stage. Because each additional dataflow stage costs extra logic and FIFOs, we reduce the resource utilization by merging the ReLU activation function into the layer preceding it. Although the layers with the newly merged ReLU functionality use more logic than before, there is still a net decrease in resources. This optimization was introduced in hls4ml's MLPerf TinyML Benchmark 2022 submission and written up in this paper. Resource reductions introduced by this optimization are reported in the paper.

Type of change

This optimization pass was first mentioned in the MLPerf TinyML PR #503.

  • [x] New feature (non-breaking change which adds functionality)
  • [x] A new research paper code implementation

Tests

This repo contains two test models (a fully-connected NN and a CNN) that can be trained on MNIST, converted into Vivado HLS, and synthesized using Vivado HLS 2020.1.

Checklist

  • [x] I have read the guidelines for contributing.
  • [x] I have commented my code, particularly in hard-to-understand areas.
  • [x] I have made corresponding changes to the documentation.
  • [x] My changes generate no new warnings.
  • [x] I have added tests that prove my fix is effective or that my feature works.

oliviaweng avatar Jun 29 '22 00:06 oliviaweng

Hi @oliviaweng thanks a lot for the contribution, it looks great!

It seems some tests are failing: https://gitlab.cern.ch/fastmachinelearning/hls4ml/-/pipelines/4158655

I think most of the failures are relatively easy to fix, e.g. CONFIG_T::out_t is missing. Could you take a look and see if you can make them pass? We can also discuss if something is not clear with the error

jmduarte avatar Jun 29 '22 02:06 jmduarte

A question on the approach itself: This will only work for ReLU and will require duplicate overrides for every possible combination of activation, layer and HLS implementation if we want to expand it. Instead of extending the kernel of the matrix-vector multiplication to tack on ReLU computation and then create duplicate function calls for that, why don't we introduce a new operation that is nop by default that sits at the end of the dense function? Basically the where the cast function sits now. Then the config can include a proper implementation of activation if we choose to merge it in. Like in config class we add template<class data_T, class CONFIG_T> using activation = nnet::some_activation_or_nop<data_T, CONFIG_T>;

This probably sounds unclear, so if you have 15 minutes we can chat over zoom.

vloncar avatar Jul 05 '22 12:07 vloncar

Should we discuss @vloncar's suggestion? It seems like there is quite a bit of interest in this PR so it will be good to get it in.

jmitrevs avatar Jul 29 '22 15:07 jmitrevs

It was discussed offline and we converged on the proper approach to this.

vloncar avatar Jul 29 '22 15:07 vloncar

Any more news on this?

jmitrevs avatar Jan 18 '23 22:01 jmitrevs

@abijithYayavaram is currently building a more generic version. We aim to push some updates within the next several weeks.

oliviaweng avatar Jan 23 '23 18:01 oliviaweng

What is the plan for this in general? Is this where we'll attack with the new code generation framework?

jmitrevs avatar Oct 20 '23 14:10 jmitrevs

I believe we should revisit this once we have a better way of generating functions in which activations would be merged. There's little gain in general if FIFO depth optimization is applied (and none for io_parallel).

vloncar avatar Oct 20 '23 14:10 vloncar