scalehls icon indicating copy to clipboard operation
scalehls copied to clipboard

Support operation chaining for more accurate estimation

Open hanchenye opened this issue 3 years ago • 2 comments

hanchenye avatar Oct 27 '20 23:10 hanchenye

Assuming addi and addi can be chained together, a possible approach is making the following transformation:

%0 = addi %arg0, %arg1 : index
%1 = addi %0, %arg2 : index

to

%0:2 = hlscpp.op_chain {
  %0 = addi %arg0, %arg1 : index
  %1 = addi %0, %arg2 : index
  hlscpp.yield %0, %1 : index, index
}

During the estimation, the hlscpp.op_chain can be considered as an operation which takes 1 clock cycle. However, this approach will have some downsides (e.g., it may make the memory access analysis more complicated). Therefore, just doing the op chaining analysis during the estimation is also possible.

hanchenye avatar Dec 30 '20 21:12 hanchenye

Excellent work!!!! It could be very practical for HLS application!!! I do hope this project can benefit more designers!!

I have gone through a similar development flow of an HLS compiler two years ago. I hope that I can help but currently, I am dealing FPGA placement problem. Therefore, I try to recall something I thin important but tortured me previously which may be useful for you:

For the operation chaining

VivadoHLS might use:

  1. Ternary Adder for 3-operand addition (TAddSub)
  2. DSP for MAC operation when bitwidth is lower than a threshold (e.g. 18-bit)
  3. DSP for Addition-Multiplication-Addition when some of the operations are constant and bitwidth is lower than a threshold (e.g. 18-bit)

For the memory access

  1. Address calculation (mul, add, and even urem) sometimes will cost a high proportion of resources when the loop unrolling factor is small/imperfect due to resource constraints.
  2. The BRAM resource for array with large bitwidth should be calculated carefully since BRAM IP cores can support some specific bitwidth, e.g. 1,2,4,9,18,36..,

Resource reusing

  1. Floating-point operators and fixed-point multipliers are frequently reused when there are multiple loops in a function without dataflow.

Function instantiation A function might lead to multiple various instances when their input arrays are different. But if the instances are identical, they might be mapped to one resource module.

Instruction Optimization

  1. When there is a large number of dependent operations, instructions will be reordered when it will not lead to the arithmetic problems. (e.g., floating-point operators will not be reordered.)
  2. Instruction hoisting from branches are utilized when the branches are relatively "balanced" from the perspective of resource and computation.

Dataflow with Branches or Bypassing It would be better to check the relationship between functions/loops. If some of them are not highly coupled, they could generate independent "top functions" and interconnect them via Tcl command in Vivado, which can overcome the limitation of Vivado HLS.

zslwyuan avatar Aug 02 '21 16:08 zslwyuan