hls4ml
hls4ml copied to clipboard
oneAPI backend update: kernel and layer optimizations
Description
This is a replacement of #1218, moving the branch to the main repository for easier contribution by others.
Type of change
- [x] Breaking change (fix or feature that would cause existing functionality to not work as expected)
This PR introduces improvements to the oneAPI inference backend, focusing on:
- Utilizing sideband signals (sop and eop) in StreamingBeat for multi-kernel synchronization.
- Refactoring core layers (Dense & ReLU) to employ always-run kernels and non-blocking I/O.
- Introducing compile-time type extraction utilities for streamlined template handling.
- Adding DMA-based data movement for generic execution.
- Automated code generation.
Sideband Signal Support
- Added start-of-packet (sop) and end-of-packet (eop) signals for kernel synchronization.
- The following using-directive is generated per inter-kernel pipe and hostpipe. This ensures multiple kernels can operate in sync.
using InputBeatT = sycl::ext::intel::experimental::StreamingBeat< data_T, // Data type true, // Enable start-of-packet true>; // Enable end-of-packet
Updated Dense and ReLU Layer for Always-Running Execution
- Uses
sop/eopsideband signals for synchronization. - Implements non-blocking reads for seamless streaming.
- Utilizes
whileloop for always-on kernel execution.
Added DMA Kernels for Hardware Execution
- DMA-based data movement for improved memory transfer:
DMA_convert_dataandDMA_convert_data_backmove data between host and FPGA efficiently.
template <class srcType, class dest_pipe, size_t num_iterations> struct DMA_convert_data {}; template <class src_pipe, class dstType, size_t num_iterations> struct DMA_convert_data_back {}; - Modification to the way that testbench starts
q.single_task(DMA_convert_data<float, Conv1DInputPipe, num_iterations>{vals_ptr}); q.single_task(Myproject{}); q.single_task(DMA_convert_data_back<Layer4OutPipe, float, num_iterations>{output_ptr}).wait();
Utility Functions for Compile-Time Type Extraction
- Added helper structs to extract data types from pipes and StreamingBeat:
Tests
Tested the updated layers in emulation, simulation, and hardware run. Tests conducted by generating the project file using the oneAPI backend code generator, and compiling for the binary using cmake.
Test Configuration:
- Configure the Quartus Prime Pro software with environment variables correctly setup (needed for simulation and bitstream generation.)
- Configure the oneAPI environment with the extension Environment Configurator for oneAPI Toolkits.
- Source the
setvarsscript.
Checklist
- [X] I have read the guidelines for contributing.
- [X] I have commented my code, particularly in hard-to-understand areas.
- [ ] I have made corresponding changes to the documentation.
- [ ] My changes generate no new warnings.
- [x] I have installed and run
pre-commiton the files I edited or added. - [ ] I have added tests that prove my fix is effective or that my feature works.
I noticed, by the way, that ReLU uses blocking reads, and all the components use blocking writes. Is there a requirement to use nonblocking reads and writes? Note, we do need to handle back-pressure, which is much more natural to do with blocking I/O.