stencilflow WIP: Fix cross dependent path

[x] check ComputeGraph: setup_internal_buffers
[x] check KernelChainGraph: compute_kernel_latency
[x] check KernelChainGraph: compute_delay_buffer
[x] apply persistent fix & remove manual fix for horidiff

Sep 11 '21 13:09 andreaskuster

@definelicht

Are the op_latency values in stencilflow/compute_graph.config accurate or do we have to make them dynamic i.e. as a function of the device type or IP block we invoke?

Furthermore, we model the pipeline latency as 1 elem/cycle (push/pop). Is this accurate in all cases?

Sep 11 '21 14:09 andreaskuster

In practice they are dynamic, but I think our current approach of just estimating them conservatively is "good enough". The true buffer space consumption comes from having to buffering large delays, which will always be orders of magnitude bigger than a conservative estimation of the operation latency. We could even just estimate any arithmetic operation on floating point numbers to be 16 cycles and be done with it :-)

Sep 13 '21 07:09 definelicht

In practice they are dynamic, but I think our current approach of just estimating them conservatively is "good enough". The true buffer space consumption comes from having to buffering large delays, which will always be orders of magnitude bigger than a conservative estimation of the operation latency. We could even just estimate any arithmetic operation on floating point numbers to be 16 cycles and be done with it :-)

Well, that might be an issue, since we have several paths. Setting the arithmetic operation latency too high might be able to create interlocks since the buffer of the other path has been over-estimated (i.e. size of delay buffer too high).

Sep 13 '21 18:09 andreaskuster

What did you want me to review here? I could only find some debugging code

Sorry for that, it is still WIP, and pushed the review button by mistake, and removed it immediately, but I guess the email notificaiton has already been sent out.

Sep 13 '21 18:09 andreaskuster

Well, that might be an issue, since we have several paths. Setting the arithmetic operation latency too high might be able to create interlocks since the buffer of the other path has been over-estimated (i.e. size of delay buffer too high).

I don't think I understand why this is an issue. As long as we use FIFOs (instead of, say, fixed depth shift registers), isn't it fine if they're too big? Then they will just never use their full capacity.

Sep 14 '21 07:09 definelicht

@definelicht FYI: The dace submodule link from master is invalid i.e. 404

Sep 14 '21 12:09 andreaskuster

@definelicht FYI: The dace submodule link from master is invalid i.e. 404

Fixed

Sep 14 '21 12:09 definelicht

Furthermore, we have a mpi4py as a python dependency, but do not check if mpi-dev is installed (i.e. for me mpi.h is missing on fpga1)

Sep 14 '21 12:09 andreaskuster

Furthermore, we have a mpi4py as a python dependency, but do not check if mpi-dev is installed (i.e. for me mpi.h is missing on fpga1)

This is okay for me, but perhaps we can make it optional so it only fails if trying to run something distributed?

Sep 14 '21 12:09 definelicht

Furthermore, we have a mpi4py as a python dependency, but do not check if mpi-dev is installed (i.e. for me mpi.h is missing on fpga1)

This is okay for me, but perhaps we can make it optional so it only fails if trying to run something distributed?

Yep, that makes sense

Sep 15 '21 10:09 andreaskuster

Well, that might be an issue, since we have several paths. Setting the arithmetic operation latency too high might be able to create interlocks since the buffer of the other path has been over-estimated (i.e. size of delay buffer too high).

I don't think I understand why this is an issue. As long as we use FIFOs (instead of, say, fixed depth shift registers), isn't it fine if they're too big? Then they will just never use their full capacity.

To my understanding, the example below might stall in both cases, i.e. if the delay buffer is too small, but also too big, right?

signal-2021-09-14-145557_001

Sep 15 '21 10:09 andreaskuster

To my understanding, the example below might stall in both cases, i.e. if the delay buffer is too small, but also too big, right?

The size of the delay buffer should have no influence on when the data arrives from c to k3. These are just FIFOs, so they can be read early. k3 will start consuming as soon as it has data available on both inputs regardless of the size of the delay buffer. How did you calculate the 2048 cycles arrival at k3?

Sep 15 '21 11:09 definelicht

To my understanding, the example below might stall in both cases, i.e. if the delay buffer is too small, but also too big, right?

The size of the delay buffer should have no influence on when the data arrives from c to k3. These are just FIFOs, so they can be read early. k3 will start consuming as soon as it has data available on both inputs regardless of the size of the delay buffer. How did you calculate the 2048 cycles arrival at k3?

Ok yes you are right, but we are basically wasting memory.

Sep 15 '21 12:09 andreaskuster

Ok yes you are right, but we are basically wasting memory

Yep, this is the "safe" solution. Keep in mind that the buffer sizes from this will be tiny compared to buffering a slice of the 2D domain, so I'm not concerned about this being a significant factor :-)

Sep 15 '21 12:09 definelicht

Compiling reference SDFG...
Loading input arrays...
Initializing output arrays...
Executing DaCe program...
Finished running program.
Executing reference DaCe program...
Finished running program.
Results saved to results/horidiff
Reference results saved to results/horidiff/reference
Comparing to reference SDFG...
Results verified.
bin/run_program.py test/stencils/horidiff.json emulation -compare-to-referenc  47.84s user 11.63s system 214% cpu 27.689 total

It seems to be fixed, please reproduce this finding before we merge :)

Sep 15 '21 12:09 andreaskuster

Compiling reference SDFG...
Loading input arrays...
Initializing output arrays...
Executing DaCe program...
Finished running program.
Executing reference DaCe program...
Finished running program.
Results saved to results/horidiff
Reference results saved to results/horidiff/reference
Comparing to reference SDFG...
Results verified.
bin/run_program.py test/stencils/horidiff.json emulation -compare-to-referenc  47.84s user 11.63s system 214% cpu 27.689 total

It seems to be fixed, please reproduce this finding before we merge :)

For min channel depth = 1024

Sep 15 '21 12:09 andreaskuster

stencilflow stencilflow copied to clipboard

WIP: Fix cross dependent path

stencilflow
stencilflow copied to clipboard