dace
dace copied to clipboard
CUDA/FPGA compilation errors
I'm trying to apply some transformations for GPU/FPGA device but I'm getting some errors:
any ideas how can I fix it?
I have installed cuda toolkit via sudo apt install,
my gcc --version is 9
But I don't know where should I link it for dace to use it ( I can't find gcc in ~/.dace.conf)
dace.txt
Thanks for help.
#Edit
After changing:
for index in dace.map[0:size]:
to @dace.map(_[0:size]) def fun(index):
FPGA is working normally. Cuda still fails.
@bartokon Seems like CUDA is failing because its version is too old to support GCC 9 or newer. The official recommendation from NVIDIA is to install the CUDA toolkit only with their repo (rather than the Ubuntu repo, which typically has older versions of CUDA). See here: https://developer.nvidia.com/cuda-downloads
As for the other bug, if you have a small reproducer function, that would be super helpful!
Installing CUDA from Nvidia site and rebooting works. I will try to post the code in a few days.
@tbennun Auto_opt like to break things + try to uncomment different functions. They should do the same but different :) threshold_numpy.zip I can't even try basic for loop without breaking program :dagger: but parallel works good on cpu/gpu but not on FPGA (There should be an option to mannualy add gpu/fpga/cpu specific pragma like #pragma hls partition in python code)...
Hi @definelicht, is it possible to add Auto_opt that adds for example 4kb (max axi burst size) temp buffor for input data? It is really killing FPGA function performance by using direct pointers to gmem instead small local buffers.
Thank you for the detailed example!
Any ideas about the FPGA issues @definelicht @TizianoDeMatteis ?
I have noticed that:
@dace.map def calc_mask(pix: _[0:size]):
generated loop isn't unrolled but why?
for (int pix = 0; pix < size; pix += 1) {
#pragma HLS PIPELINE II=1
#pragma HLS LOOP_FLATTEN
{/*some code */}
}
This should instead generate:
for (int pix = 0; pix < size; pix += 1) {
#pragma HLS PIPELINE II=1
#pragma HLS UNROLL
#pragma HLS LOOP_FLATTEN
{/*some code */}
}
Now even If I manually created "burst_size" "pix" packet transfer. These loops are only pipelined even if local array is completly parititioned using remixed function global_to_local:
fpga_auto_opt.fpga_global_to_local(sdfg, max_size=burst_size) fpga_auto_opt.fpga_local_to_registers(sdfg, max_size=burst_size)
why first global_to_local?
loc_in_pixels = dace.define_local(shape=(burst_size), dtype=dace.uint8)
it is global array by default (should it be that way?)
(There should be an option to mannualy add gpu/fpga/cpu specific pragma like #pragma hls partition in python code)...
What is your use case -- which pragma are you lacking? We support fully partitioned memories, and automatically partitioned memories.
Hi @definelicht, is it possible to add Auto_opt that adds for example 4kb (max axi burst size) temp buffor for input data? It is really killing FPGA function performance by using direct pointers to gmem instead small local buffers.
Xilinx automatically insert some buffering on their memory mapped interfaces, so I wouldn't expect this to be necessary unless you are reusing the memory.
An alternative way of doing this is adding buffer space to FIFOs between modules reading from memory and the module doing the computation, which could become part of the streaming transformations.
@TizianoDeMatteis thoughts?
I have noticed that:
@dace.map def calc_mask(pix: _[0:size]):
generated loop isn't unrolled but why?
I don't think we currently do any automatic unrolling. You can unroll it manually if you wish! @TizianoDeMatteis @alexnick83 we could think about automatically unrolling loops with constant loop indices that only access local memory.
loc_in_pixels = dace.define_local(shape=(burst_size), dtype=dace.uint8)
it is global array by default (should it be that way?)
Yes, all arrays are global by default, but can easily be changed to be local memories.
Thanks for info!
Yes, I have noticed that Xilinx creates extra buffers when using burst. But without minimal local array's there will be no burst, maybe add some micro buffering between memlets like you said streaming transformations?
About dace maps:
Maps (parallel for-loops) can be created with dace.map:
Source: https://github.com/spcl/dace/blob/master/tutorials/numpy_frontend.ipynb
If I would like pipelined loop without unrolling I would use normal for i in range(size):
loop
From two days of fun with DaCe what I can see that CPU and GPU works wonders, but FPGA coding style is so "pragma" and buffers dependent that ultimately it is just always slower.
I would like to add some specific pragmas in loop, maybe some extra commands that would be inserted into some @dace.program like:
@dace.program
def fun():
/*some code*/
loc_in_pixels = dace.define_local(shape=(burst_size), dtype=dace.uint8, memtype=dace.local) #This creates some local buffer not global!
dace.hint(loc_in_pixels, "#PRAGMA HLS PARTITION VARIABLE=loc_in_pixels COMPLETE") #Place that "text" near loc_in_pixels var definition.
for i in range(10):
dace.hint("#PRAGMA UNROLL") #Put hint right here in tasklet. Maybe add something specific like
dace.hint(var=something, text_to_add="some_text", device=dace.device_type.CPU)
@dace.program
def fun():
/*some code*/
loc_in_pixels = dace.define_local(shape=(burst_size), dtype=dace.uint8) #This creates some local buffer
dace.hint(loc_in_pixels, dace.device_type.GPU) #IF CPU is used there will be no loc_in_pixel array and all memlets using that array would be opt-away (just skip this buffer and not implement it) #You should have already something similar in transformations and map merging
Yes, I have noticed that Xilinx creates extra buffers when using burst. But without minimal local array's there will be no burst, maybe add some micro buffering between memlets like you said streaming transformations?
Xilinx detects accesses to adjacent indices in consecutive loop iterations and infers burst accesses. Local buffers are not required. For example:
void Foo(int const *from_dram, hlslib::Stream<int> &s, int n) {
for (int i = 0; i < n; ++i) {
#pragma HLS PIPELINE II=1
s.Push(from_dram[i]);
}
}
This will infer bursts of size n
, even though it's just being written to a stream.
I would like to add some specific pragmas in loop, maybe some extra commands that would be inserted into some @dace.program like:
@dace.program def fun(): /*some code*/ loc_in_pixels = dace.define_local(shape=(burst_size), dtype=dace.uint8, memtype=dace.local) #This creates some local buffer not global! dace.hint(loc_in_pixels, "#PRAGMA HLS PARTITION VARIABLE=loc_in_pixels COMPLETE") #Place that "text" near loc_in_pixels var definition.
You can achieve complete partitioning of a variable by settings its storage type to FPGA_Registers
(for example, sdfg.arrays["loc_in_pixels"].storage = dace.StorageType.FPGA_Registers
).
for i in range(10): dace.hint("#PRAGMA UNROLL") #Put hint right here in tasklet. Maybe add something specific like
We support map unrolling by simply setting unroll=True
on the map object. This might also be supported in the map
exposed in the frontend?
An alternative way of doing this is adding buffer space to FIFOs between modules reading from memory and the module doing the computation, which could become part of the streaming transformations.
@definelicht maybe we can add it as a transformation option/argument?
@bartokon I don't know if you already checked this, but regarding defining local memories we also have an auto-opt transformation (fpga_global_to_local
inside dace/transfromation/auto/fpga.py
) that automatically transforms global memories to local one (if certain conditions apply)
Actually I realized that we already have the minimum_fifo_depth
configuration parameter for Xilinx kernels :-)