stencilflow
stencilflow copied to clipboard
WIP: Fix cross dependent path
- [x] check ComputeGraph:
setup_internal_buffers - [x] check KernelChainGraph:
compute_kernel_latency - [x] check KernelChainGraph:
compute_delay_buffer - [x] apply persistent fix & remove manual fix for
horidiff
@definelicht
Are the op_latency values in stencilflow/compute_graph.config accurate or do we have to make them dynamic i.e. as a function of the device type or IP block we invoke?
Furthermore, we model the pipeline latency as 1 elem/cycle (push/pop). Is this accurate in all cases?
In practice they are dynamic, but I think our current approach of just estimating them conservatively is "good enough". The true buffer space consumption comes from having to buffering large delays, which will always be orders of magnitude bigger than a conservative estimation of the operation latency. We could even just estimate any arithmetic operation on floating point numbers to be 16 cycles and be done with it :-)
In practice they are dynamic, but I think our current approach of just estimating them conservatively is "good enough". The true buffer space consumption comes from having to buffering large delays, which will always be orders of magnitude bigger than a conservative estimation of the operation latency. We could even just estimate any arithmetic operation on floating point numbers to be 16 cycles and be done with it :-)
Well, that might be an issue, since we have several paths. Setting the arithmetic operation latency too high might be able to create interlocks since the buffer of the other path has been over-estimated (i.e. size of delay buffer too high).
What did you want me to review here? I could only find some debugging code
Sorry for that, it is still WIP, and pushed the review button by mistake, and removed it immediately, but I guess the email notificaiton has already been sent out.
Well, that might be an issue, since we have several paths. Setting the arithmetic operation latency too high might be able to create interlocks since the buffer of the other path has been over-estimated (i.e. size of delay buffer too high).
I don't think I understand why this is an issue. As long as we use FIFOs (instead of, say, fixed depth shift registers), isn't it fine if they're too big? Then they will just never use their full capacity.
@definelicht FYI: The dace submodule link from master is invalid i.e. 404
Furthermore, we have a mpi4py as a python dependency, but do not check if mpi-dev is installed (i.e. for me mpi.h is missing on fpga1)
Furthermore, we have a
mpi4pyas a python dependency, but do not check if mpi-dev is installed (i.e. for mempi.his missing onfpga1)
This is okay for me, but perhaps we can make it optional so it only fails if trying to run something distributed?
Furthermore, we have a
mpi4pyas a python dependency, but do not check if mpi-dev is installed (i.e. for mempi.his missing onfpga1)This is okay for me, but perhaps we can make it optional so it only fails if trying to run something distributed?
Yep, that makes sense
Well, that might be an issue, since we have several paths. Setting the arithmetic operation latency too high might be able to create interlocks since the buffer of the other path has been over-estimated (i.e. size of delay buffer too high).
I don't think I understand why this is an issue. As long as we use FIFOs (instead of, say, fixed depth shift registers), isn't it fine if they're too big? Then they will just never use their full capacity.
To my understanding, the example below might stall in both cases, i.e. if the delay buffer is too small, but also too big, right?

To my understanding, the example below might stall in both cases, i.e. if the delay buffer is too small, but also too big, right?
The size of the delay buffer should have no influence on when the data arrives from c to k3. These are just FIFOs, so they can be read early. k3 will start consuming as soon as it has data available on both inputs regardless of the size of the delay buffer. How did you calculate the 2048 cycles arrival at k3?
To my understanding, the example below might stall in both cases, i.e. if the delay buffer is too small, but also too big, right?
The size of the delay buffer should have no influence on when the data arrives from
ctok3. These are just FIFOs, so they can be read early.k3will start consuming as soon as it has data available on both inputs regardless of the size of the delay buffer. How did you calculate the 2048 cycles arrival atk3?
Ok yes you are right, but we are basically wasting memory.
Ok yes you are right, but we are basically wasting memory
Yep, this is the "safe" solution. Keep in mind that the buffer sizes from this will be tiny compared to buffering a slice of the 2D domain, so I'm not concerned about this being a significant factor :-)
Compiling reference SDFG...
Loading input arrays...
Initializing output arrays...
Executing DaCe program...
Finished running program.
Executing reference DaCe program...
Finished running program.
Results saved to results/horidiff
Reference results saved to results/horidiff/reference
Comparing to reference SDFG...
Results verified.
bin/run_program.py test/stencils/horidiff.json emulation -compare-to-referenc 47.84s user 11.63s system 214% cpu 27.689 total
It seems to be fixed, please reproduce this finding before we merge :)
Compiling reference SDFG... Loading input arrays... Initializing output arrays... Executing DaCe program... Finished running program. Executing reference DaCe program... Finished running program. Results saved to results/horidiff Reference results saved to results/horidiff/reference Comparing to reference SDFG... Results verified. bin/run_program.py test/stencils/horidiff.json emulation -compare-to-referenc 47.84s user 11.63s system 214% cpu 27.689 totalIt seems to be fixed, please reproduce this finding before we merge :)
For min channel depth = 1024