Vitis accelerator
Description
The Vitis Accelerator Backend builds upon the foundation laid by the Vitis backend and streamlines the generation process for PCIe accelerators using the Vitis Accelerator Flow. Features:
- This backend inherits from the Vitis backend, ensuring compatibility with existing workflows and projects.
- Converts the input of the top-level design from AXI Stream to memory-mapped and the output from memory-mapped to AXI Stream.
- Automates the generation of host code and the necessary makefile for kernel compilation.
- Please note that the software and hardware emulation features are still a work in progress and will be added in subsequent commits.
Type of change
For a new feature or function, please create an issue first to discuss it with us before submitting a pull request.
Note: Please delete options that are not relevant.
- [ ] Bug fix (non-breaking change that fixes an issue)
- [ ] Documentation update
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] A new research paper code implementation
- [ ] Other (Specify)
Tests
The backend has been tested with the hls4ml getting started tutorial example.
Test Configuration: The Vitis version used for the validation is 2022.2. The functionality of the project was tested on a VCK5000 accelerator board.
Checklist
- [x] I have read the guidelines for contributing.
- [x] I have commented my code, particularly in hard-to-understand areas.
- [x] I have made corresponding changes to the documentation.
- [x] My changes generate no new warnings.
- [x] I have installed and run
pre-commiton the files I edited or added. - [ ] I have added tests that prove my fix is effective or that my feature works.
As testing of this PR have been mentioned in the minutes of the last dev meeting, the most recent work on the host code provided with the VitisAccelerator have been pushed to ensure testing of the latest version (Also rebased on current main).
I've done a first pass, going through the hls4ml-core changes. Most of the comments are minor, just to make sure the code is consistent with the rest of the hls4ml codebase. In the following days, I'll also try out the VitisAccelerator on a local set-up with a U55C / U250 and try to review the accelerator-specific (templates, build files etc.) changes.
Overall, a very nice addition to the hls4ml codebase and seems very orthogonal to all the other functionality, so shouldn't be many issues with merging it soon.
Thanks for the review. There is probably some room for improvement, so please comment on your testing experience. We intend to do a polishing pass, mostly to provide a more seamless integration from the Python code, but maybe this can be done in a subsequent PR if the current PR is deemed usable enough.
I just tried testing the VitisAccelerator backend on Alveo u55c and Alveo u250, but there were some issues:
-
The biggest issue are timing violations: On both the u55c and u250, there is very large WNS; around -3ns to -5ns. I tried synthesising with clock periods of 4ns and 5ns; both with 27% uncertainty. Also, tried lowering the batch size to 1 (hoping it simplifies the logic and reduces congestion). Finally, I tried both with and without hw_quant. Overall, all of these cases and across boards had significant timing violations; which is a bit unexpected. To me this seems like some missing constraint in the build process or similar. I've commonly seen timing violations on the u55c around the HBM, but they are usually much smaller (-0.5ns) and can be fixed by some floor-planning and passing more advanced Vivado directives.
-
I had to change the platform for u250 to: xilinx_u250_gen3x16_xdma_4_1_202210_1, because I got the error from Vitis "Platform not found". However, a quick google of the one in this PR: xilinx_u250_xdma_201830_2, does find it. I am wondering whether there are several versions of the u250?
-
On the u250; when I changed the platform, the placer in implementation failed. There was a constrain (I guess generated by hls4ml), that forces the model kernel onto SLR0; however the specific model couldn't fit into SLR0. The model was the jet tagging model, so not too large; but I think we should avoid such explicit placements of kernels to SLRs, as it can be quite hard to estimate the resource usage of a model before actual synthesis. Per-SLR placement should probably be left to more advanced users who have issues meeting timing, in my opinion.
So in response to the above comment: the significant timing issues are only for io_parallel; io_stream has no such issues.
Thanks again for taking the time to test this!
Yes, timing closure is very design-dependent and is generally expected to be handled by the model creator. That said, you raise a good point: we mostly tested with io_stream, so we didn’t encounter this kind of issue. Since io_stream is typically the preferred option for large models in acceleration contexts, this choice made sense for our use case. However, it does make quick evaluations using io_parallel less effective (and this might be a use case for this backend). Perhaps an io_parallel-optimized version in the HLS wrapper could help address this.
Regarding the platform: yes, there are multiple platform versions (think fpga shell versions) for each board. Rather than trying to cover all cases, our goal is to offer an easy way for users to switch between them, while providing sensible defaults, though these may change over time. We should make this clearer in the documentation, or at least refer to AMD documentation about XRT platforms.
You’re also right about constraints, we shouldn’t provide any by default. It’s better to let users add them as needed for their specific designs. In the same spirit, we’ve removed explicit buffer object memory associations as well.
We’ll be fixing the constraint handling and updating the platform documentation and defaults soon. Updating the wrapper to better support io_parallel might take a bit longer, so that could come in a future PR.
Can you fix the pre-commit issues?
Considering the overlap between this PR and the new VitisUnified backend PR (#1376), the developers agreed in today’s meeting to merge the two efforts so that both SoC and accelerator support are included before merging into main. As this may take some time, this PR will be periodically rebased and maintained until an equivalent solution is merged.