hls4ml icon indicating copy to clipboard operation
hls4ml copied to clipboard

oneAPI backend update: kernel and layer optimizations

Open jmitrevs opened this issue 9 months ago • 1 comments

Description

This is a replacement of #1218, moving the branch to the main repository for easier contribution by others.

Type of change

  • [x] Breaking change (fix or feature that would cause existing functionality to not work as expected)

This PR introduces improvements to the oneAPI inference backend, focusing on:

  • Utilizing sideband signals (sop and eop) in StreamingBeat for multi-kernel synchronization.
  • Refactoring core layers (Dense & ReLU) to employ always-run kernels and non-blocking I/O.
  • Introducing compile-time type extraction utilities for streamlined template handling.
  • Adding DMA-based data movement for generic execution.
  • Automated code generation.

Sideband Signal Support

  • Added start-of-packet (sop) and end-of-packet (eop) signals for kernel synchronization.
  • The following using-directive is generated per inter-kernel pipe and hostpipe. This ensures multiple kernels can operate in sync.
    using InputBeatT = sycl::ext::intel::experimental::StreamingBeat<
        data_T, // Data type
        true,    // Enable start-of-packet
        true>;   // Enable end-of-packet
    

Updated Dense and ReLU Layer for Always-Running Execution

  • Uses sop/eop sideband signals for synchronization.
  • Implements non-blocking reads for seamless streaming.
  • Utilizes while loop for always-on kernel execution.

Added DMA Kernels for Hardware Execution

  • DMA-based data movement for improved memory transfer:
    • DMA_convert_data and DMA_convert_data_back move data between host and FPGA efficiently.
    template <class srcType, class dest_pipe, size_t num_iterations> struct DMA_convert_data {};
    template <class src_pipe, class dstType, size_t num_iterations> struct DMA_convert_data_back {};
    
  • Modification to the way that testbench starts
    q.single_task(DMA_convert_data<float, Conv1DInputPipe, num_iterations>{vals_ptr});
    q.single_task(Myproject{});
    q.single_task(DMA_convert_data_back<Layer4OutPipe, float, num_iterations>{output_ptr}).wait();
    

Utility Functions for Compile-Time Type Extraction

  • Added helper structs to extract data types from pipes and StreamingBeat:

Tests

Tested the updated layers in emulation, simulation, and hardware run. Tests conducted by generating the project file using the oneAPI backend code generator, and compiling for the binary using cmake.

Test Configuration:

  • Configure the Quartus Prime Pro software with environment variables correctly setup (needed for simulation and bitstream generation.)
  • Configure the oneAPI environment with the extension Environment Configurator for oneAPI Toolkits.
  • Source the setvars script.

Checklist

  • [X] I have read the guidelines for contributing.
  • [X] I have commented my code, particularly in hard-to-understand areas.
  • [ ] I have made corresponding changes to the documentation.
  • [ ] My changes generate no new warnings.
  • [x] I have installed and run pre-commit on the files I edited or added.
  • [ ] I have added tests that prove my fix is effective or that my feature works.

jmitrevs avatar Mar 26 '25 19:03 jmitrevs

I noticed, by the way, that ReLU uses blocking reads, and all the components use blocking writes. Is there a requirement to use nonblocking reads and writes? Note, we do need to handle back-pressure, which is much more natural to do with blocking I/O.

jmitrevs avatar May 02 '25 21:05 jmitrevs