XRT icon indicating copy to clipboard operation
XRT copied to clipboard

simple changes to operation order causes XRT to return incorrect result

Open Ralender opened this issue 3 years ago • 7 comments

The provided test case is mostly automatically generated from an issue found in our runtime build on top of XRT.

depending on when the xrt::bo is created XRT will either work as expected or return incorrect data.

steps to reproduce:

> export XCL_EMULATION_MODE=hw_emu
> unzip reprod.zip
> clang++ -o reprod host.cpp -g -I/opt/xilinx/xrt/include -L/opt/xilinx/xrt/lib -lOpenCL -luuid -lxrt_coreutil
> ./reprod
will work as expected
> clang++ -o reprod host.cpp -g -I/opt/xilinx/xrt/include -L/opt/xilinx/xrt/lib -lOpenCL -luuid -lxrt_coreutil -DFAIL
> ./reprod
will hit an assert because of incorrect outputs

I think this is a bug.

here is the host.cpp and the device.xclbin reprod.zip

the program is simply writing data given as scalar into a buffer. here is the host code also in reprod.zip:

#include <xrt/xrt_kernel.h>
#include <xrt.h>
#include <array>

int main() {

// from: _pi_platform::_pi_platform(unsigned int) sycl/plugins/xrt/pi_xrt.cpp:476
// call str:xrt::device(i)
auto name0 = xrt::device(((unsigned int)0));

// xclbin buffer size=20927220
// from: pi_result xrt_piProgramCreateWithBinary(pi_context, pi_uint32, const pi_device *, const size_t *, const unsigned char **, size_t, const pi_device_binary_property *, pi_int32 *, pi_program *) sycl/plugins/xrt/pi_xrt.cpp:1846
// call str:xrt::xclbin(reinterpret_cast<const axlf *>(binaries[0]))
auto name1 = xrt::xclbin("device.xclbin");

/// When we do buffer construction amd mapping here the data written by the kernel is incorrect
#ifdef FAIL
// from: void _pi_mem::map_if_needed(const xrt::device &) sycl/plugins/xrt/pi_xrt.cpp:555
// call str:xrt::bo(device, mem.size, XRT_BO_FLAGS_NONE, 0)
auto name6 = xrt::bo(name0, ((unsigned long)16), ((int)0), ((int)0));

// from: void _pi_mem::map_if_needed(const xrt::device &) sycl/plugins/xrt/pi_xrt.cpp:556
// call str:buffer_.map()
auto name7 = name6.map();
#endif

// from: pi_result xrt_piProgramCreateWithBinary(pi_context, pi_uint32, const pi_device *, const size_t *, const unsigned char **, size_t, const pi_device_binary_property *, pi_int32 *, pi_program *) sycl/plugins/xrt/pi_xrt.cpp:1847
// call str:dev->get().load_xclbin(xclbin)
auto name2 = name0.load_xclbin(name1);

/// BUT When we do buffer construction amd mapping here it works as expected
#ifndef FAIL
// from: void _pi_mem::map_if_needed(const xrt::device &) sycl/plugins/xrt/pi_xrt.cpp:555
// call str:xrt::bo(device, mem.size, XRT_BO_FLAGS_NONE, 0)
auto name6 = xrt::bo(name0, ((unsigned long)16), ((int)0), ((int)0));

// from: void _pi_mem::map_if_needed(const xrt::device &) sycl/plugins/xrt/pi_xrt.cpp:556
// call str:buffer_.map()
auto name7 = name6.map();
#endif

// from: pi_result xrt_piKernelCreate(pi_program, const char *, pi_kernel *) sycl/plugins/xrt/pi_xrt.cpp:1699
// call str:xrt::kernel(program->device_->get(), program->uuid_, kernel_name)
auto name3 = xrt::kernel(name0, name2, "dlerEE_clES2_E6Kernel_0i996ruy");

// from: pi_result xrt_piKernelCreate(pi_program, const char *, pi_kernel *) sycl/plugins/xrt/pi_xrt.cpp:1700
// call str:program->bin_.get_kernel(kernel_name)
auto name4 = name1.get_kernel("dlerEE_clES2_E6Kernel_0i996ruy");

// from: pi_result xrt_piKernelCreate(pi_program, const char *, pi_kernel *) sycl/plugins/xrt/pi_xrt.cpp:1701
// call str:xrt::run(ker)
auto name5 = xrt::run(name3);

// from: pi_result xrt_piextKernelSetArgMemObj(pi_kernel, pi_uint32, const pi_mem *) sycl/plugins/xrt/pi_xrt.cpp:1726
// call str:kernel->run_.set_arg(arg_index, buf->buffer_)
name5.set_arg(((unsigned int)0), name6);

// from: pi_result xrt_piKernelSetArg(pi_kernel, pi_uint32, size_t, const void *) sycl/plugins/xrt/pi_xrt.cpp:1714
std::array<char, 8> name8= {'\x04', '\x00', '\x00', '\x00', '\x00', '\x00', '\x00', '\x00'};

// from: pi_result xrt_piKernelSetArg(pi_kernel, pi_uint32, size_t, const void *) sycl/plugins/xrt/pi_xrt.cpp:1715
// call str:kernel->run_.set_arg(arg_index, arg_value, arg_size)
name5.set_arg(((unsigned int)1), ((const void*)name8.data()), ((unsigned long)8));

// from: pi_result xrt_piKernelSetArg(pi_kernel, pi_uint32, size_t, const void *) sycl/plugins/xrt/pi_xrt.cpp:1714
std::array<char, 8> name9= {'\x04', '\x00', '\x00', '\x00', '\x00', '\x00', '\x00', '\x00'};

// from: pi_result xrt_piKernelSetArg(pi_kernel, pi_uint32, size_t, const void *) sycl/plugins/xrt/pi_xrt.cpp:1715
// call str:kernel->run_.set_arg(arg_index, arg_value, arg_size)
name5.set_arg(((unsigned int)2), ((const void*)name9.data()), ((unsigned long)8));

// from: pi_result xrt_piKernelSetArg(pi_kernel, pi_uint32, size_t, const void *) sycl/plugins/xrt/pi_xrt.cpp:1714
std::array<char, 8> name10= {'\x00', '\x00', '\x00', '\x00', '\x00', '\x00', '\x00', '\x00'};

// from: pi_result xrt_piKernelSetArg(pi_kernel, pi_uint32, size_t, const void *) sycl/plugins/xrt/pi_xrt.cpp:1715
// call str:kernel->run_.set_arg(arg_index, arg_value, arg_size)
name5.set_arg(((unsigned int)3), ((const void*)name10.data()), ((unsigned long)8));

// from: pi_result xrt_piKernelSetArg(pi_kernel, pi_uint32, size_t, const void *) sycl/plugins/xrt/pi_xrt.cpp:1714
std::array<char, 16> name11= {'\x00', '\x00', '\x00', '\x00', '\x01', '\x00', '\x00', '\x00', '\x02', '\x00', '\x00', '\x00', '\x03', '\x00', '\x00', '\x00'};

// from: pi_result xrt_piKernelSetArg(pi_kernel, pi_uint32, size_t, const void *) sycl/plugins/xrt/pi_xrt.cpp:1715
// call str:kernel->run_.set_arg(arg_index, arg_value, arg_size)
name5.set_arg(((unsigned int)4), ((const void*)name11.data()), ((unsigned long)16));

// from: pi_result xrt_piEnqueueKernelLaunch(pi_queue, pi_kernel, pi_uint32, const size_t *, const size_t *, const size_t *, pi_uint32, const pi_event *, pi_event *) sycl/plugins/xrt/pi_xrt.cpp:1754
// call str:kernel->run_.start()
name5.start();

// from: pi_result xrt_piEnqueueKernelLaunch(pi_queue, pi_kernel, pi_uint32, const size_t *, const size_t *, const size_t *, pi_uint32, const pi_event *, pi_event *) sycl/plugins/xrt/pi_xrt.cpp:1755
// call str:kernel->run_.wait()
name5.wait();

// from: pi_result xrt_piEnqueueMemBufferRead(pi_queue, pi_mem, pi_bool, size_t, size_t, void *, pi_uint32, const pi_event *, pi_event *) sycl/plugins/xrt/pi_xrt.cpp:1679
// call str:buffer->buffer_.sync(XCL_BO_SYNC_BO_FROM_DEVICE)
name6.sync(((xclBOSyncDirection)1));

// from: pi_result xrt_piEnqueueMemBufferRead(pi_queue, pi_mem, pi_bool, size_t, size_t, void *, pi_uint32, const pi_event *, pi_event *) sycl/plugins/xrt/pi_xrt.cpp:1681
std::array<char, 16> name12= {'\xc0', '\xdc', '\x3a', '\xf7', '\xff', '\x7f', '\x00', '\x00', '\xc0', '\xdc', '\x3a', '\xf7', '\xff', '\x7f', '\x00', '\x00'};

// from: pi_result xrt_piEnqueueMemBufferRead(pi_queue, pi_mem, pi_bool, size_t, size_t, void *, pi_uint32, const pi_event *, pi_event *) sycl/plugins/xrt/pi_xrt.cpp:1682
// call str:(void)std::memcpy(ptr, adjusted_ptr, size)
(void)std::memcpy(((void*)name12.data()), name7, ((unsigned long)16));

int *a_c = ((int *)name12.data());
for (int i = 0; i < 4; i++) {
int res = i;
int val = a_c[i];
assert(val == res);
}
}

Ralender avatar Apr 08 '22 18:04 Ralender

I have tried:

-<%>- time ./reprod                                                                                               
WARNING: XCLBIN used is generated with Vivado version 2022.1.0 where as it is run with the Vivado version 2021.2 which is not compatible. May result to weird behaviour.
INFO: [HW-EMU 07-0] Please refer the path "/tmp/.run/1737097/hw_em/device0/binary_0/behav_waveform/xsim/simulate.log" for more detailed simulation infos, errors and warnings.
INFO: [Common 17-206] Exiting xsim at Tue Apr 12 15:07:28 2022...
SIMULATION EXITED
[2]    1737097 segmentation fault (core dumped)  ./reprod
./reprod  0.59s user 0.08s system 0% cpu 5:00.12 total

keryell avatar Apr 12 '22 22:04 keryell

Is it a problem you have also on real hardware execution or only on hardware emulation?

keryell avatar Apr 12 '22 22:04 keryell

@Ralender . Thanks. This is emulation and I am not quite sure what assumptions emulation makes in regards to xclbin loaded or not prior to buffer allocation. I suspect the problem is the loading of the xclbin. If you move FAIL immediately after loading of xclbin, does it still fail?

Still of course, if xclbin is required to be loaded, then your FAIL allocation should truly have failed with bad_alloc. That would be the case in HW where the current MEM_TOPOLOGY known by driver must at a minimum match the bank you request in BO allocation.

But even in HW, it is not clear to me what happens with an already allocated BO in case the xclbin is swapped out with another one? @maxzhen ? If the mem topology of the second xclbin matches that of the former presumably it should work, but if not, then what happens.?

@Ralender can you try the FAIL after xclbin and annotate the issue. It could be emulation related and I will assign the issue accordingly.

stsoe avatar Apr 13 '22 20:04 stsoe

@Ralender . Thanks. This is emulation and I am not quite sure what assumptions emulation makes in regards to xclbin loaded or not prior to buffer allocation. I suspect the problem is the loading of the xclbin. If you move FAIL immediately after loading of xclbin, does it still fail?

yes, I updated the cpp and zip.

Still of course, if xclbin is required to be loaded, then your FAIL allocation should truly have failed with bad_alloc. That would be the case in HW where the current MEM_TOPOLOGY known by driver must at a minimum match the bank you request in BO allocation.

the banks should match the banks expected by the xclbin. everything is at 0.

@Ralender can you try the FAIL after xclbin and annotate the issue. It could be emulation related and I will assign the issue accordingly.

I am not sure what to annotate it as. I am compiling for hw to know if it is emulation related.

Ralender avatar Apr 13 '22 21:04 Ralender

both creating the bo before and after loading the xclbin work on hw. so it is an emulation issue

Ralender avatar Apr 13 '22 23:04 Ralender

Hi @Ralender, could you please file a CR with the test case and steps to reproduce. I would like to assign it to my team member for verification of the issue and investigation.

I have tried to run it locally. Seems I need to clang++ to follow the Steps to Reproduce for this. Can we use g++ for compiling the host code instead of clang++ for verifying why it is crashing in XRT etc.

venkatp-xilinx avatar Apr 19 '22 06:04 venkatp-xilinx

@venkatp-xilinx Yes I think that g++ will work too.

keryell avatar Sep 13 '22 20:09 keryell