dace icon indicating copy to clipboard operation
dace copied to clipboard

CUDA/FPGA compilation errors

Open bartokon opened this issue 3 years ago • 11 comments

I'm trying to apply some transformations for GPU/FPGA device but I'm getting some errors:

image image

any ideas how can I fix it? I have installed cuda toolkit via sudo apt install, my gcc --version is 9 image But I don't know where should I link it for dace to use it ( I can't find gcc in ~/.dace.conf) dace.txt

Thanks for help.

#Edit After changing: for index in dace.map[0:size]: to @dace.map(_[0:size]) def fun(index): FPGA is working normally. Cuda still fails.

bartokon avatar Dec 25 '21 21:12 bartokon

@bartokon Seems like CUDA is failing because its version is too old to support GCC 9 or newer. The official recommendation from NVIDIA is to install the CUDA toolkit only with their repo (rather than the Ubuntu repo, which typically has older versions of CUDA). See here: https://developer.nvidia.com/cuda-downloads

As for the other bug, if you have a small reproducer function, that would be super helpful!

tbennun avatar Dec 25 '21 22:12 tbennun

Installing CUDA from Nvidia site and rebooting works. I will try to post the code in a few days.

bartokon avatar Dec 26 '21 12:12 bartokon

@tbennun Auto_opt like to break things + try to uncomment different functions. They should do the same but different :) threshold_numpy.zip I can't even try basic for loop without breaking program :dagger: but parallel works good on cpu/gpu but not on FPGA (There should be an option to mannualy add gpu/fpga/cpu specific pragma like #pragma hls partition in python code)...

Hi @definelicht, is it possible to add Auto_opt that adds for example 4kb (max axi burst size) temp buffor for input data? It is really killing FPGA function performance by using direct pointers to gmem instead small local buffers.

bartokon avatar Dec 27 '21 19:12 bartokon

Thank you for the detailed example!

Any ideas about the FPGA issues @definelicht @TizianoDeMatteis ?

tbennun avatar Dec 27 '21 22:12 tbennun

I have noticed that: @dace.map def calc_mask(pix: _[0:size]): generated loop isn't unrolled but why?

for (int pix = 0; pix < size; pix += 1) {
                #pragma HLS PIPELINE II=1
                #pragma HLS LOOP_FLATTEN
                {/*some code */}
}

This should instead generate:

for (int pix = 0; pix < size; pix += 1) {
                #pragma HLS PIPELINE II=1
                #pragma HLS UNROLL
                #pragma HLS LOOP_FLATTEN
                {/*some code */}
}

Now even If I manually created "burst_size" "pix" packet transfer. These loops are only pipelined even if local array is completly parititioned using remixed function global_to_local: fpga_auto_opt.fpga_global_to_local(sdfg, max_size=burst_size) fpga_auto_opt.fpga_local_to_registers(sdfg, max_size=burst_size) why first global_to_local? loc_in_pixels = dace.define_local(shape=(burst_size), dtype=dace.uint8) it is global array by default (should it be that way?)

bartokon avatar Dec 27 '21 23:12 bartokon

(There should be an option to mannualy add gpu/fpga/cpu specific pragma like #pragma hls partition in python code)...

What is your use case -- which pragma are you lacking? We support fully partitioned memories, and automatically partitioned memories.

Hi @definelicht, is it possible to add Auto_opt that adds for example 4kb (max axi burst size) temp buffor for input data? It is really killing FPGA function performance by using direct pointers to gmem instead small local buffers.

Xilinx automatically insert some buffering on their memory mapped interfaces, so I wouldn't expect this to be necessary unless you are reusing the memory.

An alternative way of doing this is adding buffer space to FIFOs between modules reading from memory and the module doing the computation, which could become part of the streaming transformations.

@TizianoDeMatteis thoughts?

definelicht avatar Dec 28 '21 11:12 definelicht

I have noticed that: @dace.map def calc_mask(pix: _[0:size]): generated loop isn't unrolled but why?

I don't think we currently do any automatic unrolling. You can unroll it manually if you wish! @TizianoDeMatteis @alexnick83 we could think about automatically unrolling loops with constant loop indices that only access local memory.

loc_in_pixels = dace.define_local(shape=(burst_size), dtype=dace.uint8) it is global array by default (should it be that way?)

Yes, all arrays are global by default, but can easily be changed to be local memories.

definelicht avatar Dec 28 '21 11:12 definelicht

Thanks for info!

Yes, I have noticed that Xilinx creates extra buffers when using burst. But without minimal local array's there will be no burst, maybe add some micro buffering between memlets like you said streaming transformations?

About dace maps: Maps (parallel for-loops) can be created with dace.map: Source: https://github.com/spcl/dace/blob/master/tutorials/numpy_frontend.ipynb If I would like pipelined loop without unrolling I would use normal for i in range(size): loop

From two days of fun with DaCe what I can see that CPU and GPU works wonders, but FPGA coding style is so "pragma" and buffers dependent that ultimately it is just always slower.

I would like to add some specific pragmas in loop, maybe some extra commands that would be inserted into some @dace.program like:

@dace.program
def fun():
/*some code*/
loc_in_pixels = dace.define_local(shape=(burst_size), dtype=dace.uint8, memtype=dace.local) #This creates some local buffer not global!
dace.hint(loc_in_pixels, "#PRAGMA HLS PARTITION VARIABLE=loc_in_pixels COMPLETE") #Place that "text" near loc_in_pixels var definition.
for i in range(10):
dace.hint("#PRAGMA UNROLL") #Put hint right here in tasklet. Maybe add something specific like
dace.hint(var=something, text_to_add="some_text", device=dace.device_type.CPU)
@dace.program
def fun():
/*some code*/
loc_in_pixels = dace.define_local(shape=(burst_size), dtype=dace.uint8) #This creates some local buffer
dace.hint(loc_in_pixels, dace.device_type.GPU) #IF CPU is used there will be no loc_in_pixel array and all memlets using that array would be opt-away (just skip this buffer and not implement it)  #You should have already something similar in transformations and map merging

bartokon avatar Dec 28 '21 12:12 bartokon

Yes, I have noticed that Xilinx creates extra buffers when using burst. But without minimal local array's there will be no burst, maybe add some micro buffering between memlets like you said streaming transformations?

Xilinx detects accesses to adjacent indices in consecutive loop iterations and infers burst accesses. Local buffers are not required. For example:

void Foo(int const *from_dram, hlslib::Stream<int> &s, int n) {
  for (int i = 0; i < n; ++i) {
    #pragma HLS PIPELINE II=1
    s.Push(from_dram[i]);
  }
}

This will infer bursts of size n, even though it's just being written to a stream.

I would like to add some specific pragmas in loop, maybe some extra commands that would be inserted into some @dace.program like:

@dace.program
def fun():
/*some code*/
loc_in_pixels = dace.define_local(shape=(burst_size), dtype=dace.uint8, memtype=dace.local) #This creates some local buffer not global!
dace.hint(loc_in_pixels, "#PRAGMA HLS PARTITION VARIABLE=loc_in_pixels COMPLETE") #Place that "text" near loc_in_pixels var definition.

You can achieve complete partitioning of a variable by settings its storage type to FPGA_Registers (for example, sdfg.arrays["loc_in_pixels"].storage = dace.StorageType.FPGA_Registers).

for i in range(10): dace.hint("#PRAGMA UNROLL") #Put hint right here in tasklet. Maybe add something specific like

We support map unrolling by simply setting unroll=True on the map object. This might also be supported in the map exposed in the frontend?

definelicht avatar Dec 30 '21 15:12 definelicht

An alternative way of doing this is adding buffer space to FIFOs between modules reading from memory and the module doing the computation, which could become part of the streaming transformations.

@definelicht maybe we can add it as a transformation option/argument?

@bartokon I don't know if you already checked this, but regarding defining local memories we also have an auto-opt transformation (fpga_global_to_local inside dace/transfromation/auto/fpga.py) that automatically transforms global memories to local one (if certain conditions apply)

TizianoDeMatteis avatar Jan 03 '22 07:01 TizianoDeMatteis

Actually I realized that we already have the minimum_fifo_depth configuration parameter for Xilinx kernels :-)

definelicht avatar Jan 03 '22 09:01 definelicht