hls4ml icon indicating copy to clipboard operation
hls4ml copied to clipboard

Depthwise 1D and 2D Resource strategy for io_stream

Open steltze opened this issue 1 year ago • 4 comments

Description

  • Related to #859
  • [ ] Add a detailed description of the changes
  • Added const to the depth variable of the FIFOs between the Depthwise and the Pointwise convolutions to fix a Vitis HLS (2023.2) warning

Type of change

  • [x] New feature (non-breaking change which adds functionality)

Tests

  • Apart from the numerical correctness in the pytests comparing Keras to hls4ml, executed successfully the c-synthesis
  • Checked the numerical correctness and c-synthesis of a 10k parameter UNet that has 20 Separable 2D Convolutions

Checklist

  • [x] I have read the guidelines for contributing.
  • [ ] I have commented my code, particularly in hard-to-understand areas.
  • [ ] I have made corresponding changes to the documentation.
  • [x] My changes generate no new warnings.
  • [x] I have installed and run pre-commit on the files I edited or added.
  • [x] I have added tests that prove my fix is effective or that my feature works.

steltze avatar Oct 11 '24 15:10 steltze

Some results from vivado synthesis

steltze avatar Oct 11 '24 15:10 steltze

Thanks. I should mention that there is also an effort to develop depthwise (and seperable, which gets split) for oneAPI, which will replace the Quartus backend eventually. It's in a PR to the oneAPI PR.

jmitrevs avatar Oct 11 '24 15:10 jmitrevs

Updated to the latest main branch since I think it fixes the issue that caused the test failure.

jmitrevs avatar Oct 16 '24 19:10 jmitrevs

Pipelining over the kernel size might produce better QoR. Pending to test this hypothesis

steltze avatar Oct 17 '24 19:10 steltze

What is blocking this PR to be merged?

nghielme avatar Nov 21 '24 10:11 nghielme

@nghielme It is still marked as a draft, so no one took the time to thoroughly review it :wink:. Also, it will come after 1.0.0

vloncar avatar Nov 21 '24 14:11 vloncar

With this patch my Vivado HLS 2019.2 fails with following output, regardless of resource or latency mode:

ERROR: [HLS 200-70] '#pragma HLS ARRAY_PARTITION variable=&acc type=complete' is not a valid pragma.
ERROR: [HLS 200-70] '#pragma HLS ARRAY_RESHAPE variable=&weights type=block factor=block_factor' is not a valid pragma.
ERROR: [HLS 200-70] '#pragma HLS ARRAY_RESHAPE variable=&data type=block factor=block_factor' is not a valid pragma.
ERROR: [HLS 200-70] '#pragma HLS ARRAY_PARTITION variable=&acc type=complete' is not a valid pragma.
ERROR: [HLS 200-70] '#pragma HLS ARRAY_PARTITION variable=&acc type=complete' is not a valid pragma.

The issue can be solved by removing type= from these pragmas as done elsewhere in the code base.

xtreme8000 avatar Jan 01 '25 23:01 xtreme8000

thanks @xtreme8000! fixed

steltze avatar Jan 09 '25 15:01 steltze

The only thing I see lacking here is the use of function pointers for depnse instead of the current if (strategy == latency) do_latency_dense() else do_resource_dense() approach. We moved the rest of the codebase towards that model as it allows the dense function to be generated and implemented in special ways (HGQ and resource_unrolled use this). I've made the changes needded and put them in vloncar/stelios_depthwise_resource.

After extensive testing of both approaches, I've made a few observations:

  • In some rare instances function pointer adds a clock cycle to the main loop of depthwise function. I didn't manage to figure out why, the schedule viewer shows both approaches resulting in identical sequence of operations, but with function pointer one having one operation after depthwise_product function scheduled in the next cycle for no apparent reason. I'm thinking to follow this up with AMD.
  • Sometimes the 3rd case (rf > n_chan and rf % n_chan != 0) fails to meet timing. This is regardless of the approach used. We could consider enforcing RF that avoids it
  • Depending on the size of the layer/input, the pragma HLS INLINE recursive just prior to calling depthwise_resource may have a detrimental effect (going from added cycles to missing timing). Again, regardless of the approach. FWIW, this also affects regular Conv1D/2D.

Since the issues are specific to a use case, I'm thinking to merge the function pointer changes into this and then merge it to main. (there's also a trivial change to tests to avoid creating directory within a directory.) If the issue appears, it is trivial to manually tweak around it to avoid the use of function pointer and/or tweak the inline pragma. The 3rd case can also be avoided by the user if they observe that issue. If no one objects I'll do this tomorrow.

vloncar avatar Feb 24 '25 23:02 vloncar

This is now ready. Let's just wait to see if I messed up the tests before we merge.

vloncar avatar Feb 26 '25 20:02 vloncar