scalehls
scalehls copied to clipboard
Support operation chaining for more accurate estimation
Assuming addi
and addi
can be chained together, a possible approach is making the following transformation:
%0 = addi %arg0, %arg1 : index
%1 = addi %0, %arg2 : index
to
%0:2 = hlscpp.op_chain {
%0 = addi %arg0, %arg1 : index
%1 = addi %0, %arg2 : index
hlscpp.yield %0, %1 : index, index
}
During the estimation, the hlscpp.op_chain
can be considered as an operation which takes 1 clock cycle. However, this approach will have some downsides (e.g., it may make the memory access analysis more complicated). Therefore, just doing the op chaining analysis during the estimation is also possible.
Excellent work!!!! It could be very practical for HLS application!!! I do hope this project can benefit more designers!!
I have gone through a similar development flow of an HLS compiler two years ago. I hope that I can help but currently, I am dealing FPGA placement problem. Therefore, I try to recall something I thin important but tortured me previously which may be useful for you:
For the operation chaining
VivadoHLS might use:
- Ternary Adder for 3-operand addition (TAddSub)
- DSP for MAC operation when bitwidth is lower than a threshold (e.g. 18-bit)
- DSP for Addition-Multiplication-Addition when some of the operations are constant and bitwidth is lower than a threshold (e.g. 18-bit)
For the memory access
- Address calculation (mul, add, and even urem) sometimes will cost a high proportion of resources when the loop unrolling factor is small/imperfect due to resource constraints.
- The BRAM resource for array with large bitwidth should be calculated carefully since BRAM IP cores can support some specific bitwidth, e.g. 1,2,4,9,18,36..,
Resource reusing
- Floating-point operators and fixed-point multipliers are frequently reused when there are multiple loops in a function without dataflow.
Function instantiation A function might lead to multiple various instances when their input arrays are different. But if the instances are identical, they might be mapped to one resource module.
Instruction Optimization
- When there is a large number of dependent operations, instructions will be reordered when it will not lead to the arithmetic problems. (e.g., floating-point operators will not be reordered.)
- Instruction hoisting from branches are utilized when the branches are relatively "balanced" from the perspective of resource and computation.
Dataflow with Branches or Bypassing It would be better to check the relationship between functions/loops. If some of them are not highly coupled, they could generate independent "top functions" and interconnect them via Tcl command in Vivado, which can overcome the limitation of Vivado HLS.