simple changes to operation order causes XRT to return incorrect result
The provided test case is mostly automatically generated from an issue found in our runtime build on top of XRT.
depending on when the xrt::bo is created XRT will either work as expected or return incorrect data.
steps to reproduce:
> export XCL_EMULATION_MODE=hw_emu
> unzip reprod.zip
> clang++ -o reprod host.cpp -g -I/opt/xilinx/xrt/include -L/opt/xilinx/xrt/lib -lOpenCL -luuid -lxrt_coreutil
> ./reprod
will work as expected
> clang++ -o reprod host.cpp -g -I/opt/xilinx/xrt/include -L/opt/xilinx/xrt/lib -lOpenCL -luuid -lxrt_coreutil -DFAIL
> ./reprod
will hit an assert because of incorrect outputs
I think this is a bug.
here is the host.cpp and the device.xclbin reprod.zip
the program is simply writing data given as scalar into a buffer. here is the host code also in reprod.zip:
#include <xrt/xrt_kernel.h>
#include <xrt.h>
#include <array>
int main() {
// from: _pi_platform::_pi_platform(unsigned int) sycl/plugins/xrt/pi_xrt.cpp:476
// call str:xrt::device(i)
auto name0 = xrt::device(((unsigned int)0));
// xclbin buffer size=20927220
// from: pi_result xrt_piProgramCreateWithBinary(pi_context, pi_uint32, const pi_device *, const size_t *, const unsigned char **, size_t, const pi_device_binary_property *, pi_int32 *, pi_program *) sycl/plugins/xrt/pi_xrt.cpp:1846
// call str:xrt::xclbin(reinterpret_cast<const axlf *>(binaries[0]))
auto name1 = xrt::xclbin("device.xclbin");
/// When we do buffer construction amd mapping here the data written by the kernel is incorrect
#ifdef FAIL
// from: void _pi_mem::map_if_needed(const xrt::device &) sycl/plugins/xrt/pi_xrt.cpp:555
// call str:xrt::bo(device, mem.size, XRT_BO_FLAGS_NONE, 0)
auto name6 = xrt::bo(name0, ((unsigned long)16), ((int)0), ((int)0));
// from: void _pi_mem::map_if_needed(const xrt::device &) sycl/plugins/xrt/pi_xrt.cpp:556
// call str:buffer_.map()
auto name7 = name6.map();
#endif
// from: pi_result xrt_piProgramCreateWithBinary(pi_context, pi_uint32, const pi_device *, const size_t *, const unsigned char **, size_t, const pi_device_binary_property *, pi_int32 *, pi_program *) sycl/plugins/xrt/pi_xrt.cpp:1847
// call str:dev->get().load_xclbin(xclbin)
auto name2 = name0.load_xclbin(name1);
/// BUT When we do buffer construction amd mapping here it works as expected
#ifndef FAIL
// from: void _pi_mem::map_if_needed(const xrt::device &) sycl/plugins/xrt/pi_xrt.cpp:555
// call str:xrt::bo(device, mem.size, XRT_BO_FLAGS_NONE, 0)
auto name6 = xrt::bo(name0, ((unsigned long)16), ((int)0), ((int)0));
// from: void _pi_mem::map_if_needed(const xrt::device &) sycl/plugins/xrt/pi_xrt.cpp:556
// call str:buffer_.map()
auto name7 = name6.map();
#endif
// from: pi_result xrt_piKernelCreate(pi_program, const char *, pi_kernel *) sycl/plugins/xrt/pi_xrt.cpp:1699
// call str:xrt::kernel(program->device_->get(), program->uuid_, kernel_name)
auto name3 = xrt::kernel(name0, name2, "dlerEE_clES2_E6Kernel_0i996ruy");
// from: pi_result xrt_piKernelCreate(pi_program, const char *, pi_kernel *) sycl/plugins/xrt/pi_xrt.cpp:1700
// call str:program->bin_.get_kernel(kernel_name)
auto name4 = name1.get_kernel("dlerEE_clES2_E6Kernel_0i996ruy");
// from: pi_result xrt_piKernelCreate(pi_program, const char *, pi_kernel *) sycl/plugins/xrt/pi_xrt.cpp:1701
// call str:xrt::run(ker)
auto name5 = xrt::run(name3);
// from: pi_result xrt_piextKernelSetArgMemObj(pi_kernel, pi_uint32, const pi_mem *) sycl/plugins/xrt/pi_xrt.cpp:1726
// call str:kernel->run_.set_arg(arg_index, buf->buffer_)
name5.set_arg(((unsigned int)0), name6);
// from: pi_result xrt_piKernelSetArg(pi_kernel, pi_uint32, size_t, const void *) sycl/plugins/xrt/pi_xrt.cpp:1714
std::array<char, 8> name8= {'\x04', '\x00', '\x00', '\x00', '\x00', '\x00', '\x00', '\x00'};
// from: pi_result xrt_piKernelSetArg(pi_kernel, pi_uint32, size_t, const void *) sycl/plugins/xrt/pi_xrt.cpp:1715
// call str:kernel->run_.set_arg(arg_index, arg_value, arg_size)
name5.set_arg(((unsigned int)1), ((const void*)name8.data()), ((unsigned long)8));
// from: pi_result xrt_piKernelSetArg(pi_kernel, pi_uint32, size_t, const void *) sycl/plugins/xrt/pi_xrt.cpp:1714
std::array<char, 8> name9= {'\x04', '\x00', '\x00', '\x00', '\x00', '\x00', '\x00', '\x00'};
// from: pi_result xrt_piKernelSetArg(pi_kernel, pi_uint32, size_t, const void *) sycl/plugins/xrt/pi_xrt.cpp:1715
// call str:kernel->run_.set_arg(arg_index, arg_value, arg_size)
name5.set_arg(((unsigned int)2), ((const void*)name9.data()), ((unsigned long)8));
// from: pi_result xrt_piKernelSetArg(pi_kernel, pi_uint32, size_t, const void *) sycl/plugins/xrt/pi_xrt.cpp:1714
std::array<char, 8> name10= {'\x00', '\x00', '\x00', '\x00', '\x00', '\x00', '\x00', '\x00'};
// from: pi_result xrt_piKernelSetArg(pi_kernel, pi_uint32, size_t, const void *) sycl/plugins/xrt/pi_xrt.cpp:1715
// call str:kernel->run_.set_arg(arg_index, arg_value, arg_size)
name5.set_arg(((unsigned int)3), ((const void*)name10.data()), ((unsigned long)8));
// from: pi_result xrt_piKernelSetArg(pi_kernel, pi_uint32, size_t, const void *) sycl/plugins/xrt/pi_xrt.cpp:1714
std::array<char, 16> name11= {'\x00', '\x00', '\x00', '\x00', '\x01', '\x00', '\x00', '\x00', '\x02', '\x00', '\x00', '\x00', '\x03', '\x00', '\x00', '\x00'};
// from: pi_result xrt_piKernelSetArg(pi_kernel, pi_uint32, size_t, const void *) sycl/plugins/xrt/pi_xrt.cpp:1715
// call str:kernel->run_.set_arg(arg_index, arg_value, arg_size)
name5.set_arg(((unsigned int)4), ((const void*)name11.data()), ((unsigned long)16));
// from: pi_result xrt_piEnqueueKernelLaunch(pi_queue, pi_kernel, pi_uint32, const size_t *, const size_t *, const size_t *, pi_uint32, const pi_event *, pi_event *) sycl/plugins/xrt/pi_xrt.cpp:1754
// call str:kernel->run_.start()
name5.start();
// from: pi_result xrt_piEnqueueKernelLaunch(pi_queue, pi_kernel, pi_uint32, const size_t *, const size_t *, const size_t *, pi_uint32, const pi_event *, pi_event *) sycl/plugins/xrt/pi_xrt.cpp:1755
// call str:kernel->run_.wait()
name5.wait();
// from: pi_result xrt_piEnqueueMemBufferRead(pi_queue, pi_mem, pi_bool, size_t, size_t, void *, pi_uint32, const pi_event *, pi_event *) sycl/plugins/xrt/pi_xrt.cpp:1679
// call str:buffer->buffer_.sync(XCL_BO_SYNC_BO_FROM_DEVICE)
name6.sync(((xclBOSyncDirection)1));
// from: pi_result xrt_piEnqueueMemBufferRead(pi_queue, pi_mem, pi_bool, size_t, size_t, void *, pi_uint32, const pi_event *, pi_event *) sycl/plugins/xrt/pi_xrt.cpp:1681
std::array<char, 16> name12= {'\xc0', '\xdc', '\x3a', '\xf7', '\xff', '\x7f', '\x00', '\x00', '\xc0', '\xdc', '\x3a', '\xf7', '\xff', '\x7f', '\x00', '\x00'};
// from: pi_result xrt_piEnqueueMemBufferRead(pi_queue, pi_mem, pi_bool, size_t, size_t, void *, pi_uint32, const pi_event *, pi_event *) sycl/plugins/xrt/pi_xrt.cpp:1682
// call str:(void)std::memcpy(ptr, adjusted_ptr, size)
(void)std::memcpy(((void*)name12.data()), name7, ((unsigned long)16));
int *a_c = ((int *)name12.data());
for (int i = 0; i < 4; i++) {
int res = i;
int val = a_c[i];
assert(val == res);
}
}
I have tried:
-<%>- time ./reprod
WARNING: XCLBIN used is generated with Vivado version 2022.1.0 where as it is run with the Vivado version 2021.2 which is not compatible. May result to weird behaviour.
INFO: [HW-EMU 07-0] Please refer the path "/tmp/.run/1737097/hw_em/device0/binary_0/behav_waveform/xsim/simulate.log" for more detailed simulation infos, errors and warnings.
INFO: [Common 17-206] Exiting xsim at Tue Apr 12 15:07:28 2022...
SIMULATION EXITED
[2] 1737097 segmentation fault (core dumped) ./reprod
./reprod 0.59s user 0.08s system 0% cpu 5:00.12 total
Is it a problem you have also on real hardware execution or only on hardware emulation?
@Ralender . Thanks. This is emulation and I am not quite sure what assumptions emulation makes in regards to xclbin loaded or not prior to buffer allocation. I suspect the problem is the loading of the xclbin. If you move FAIL immediately after loading of xclbin, does it still fail?
Still of course, if xclbin is required to be loaded, then your FAIL allocation should truly have failed with bad_alloc. That would be the case in HW where the current MEM_TOPOLOGY known by driver must at a minimum match the bank you request in BO allocation.
But even in HW, it is not clear to me what happens with an already allocated BO in case the xclbin is swapped out with another one? @maxzhen ? If the mem topology of the second xclbin matches that of the former presumably it should work, but if not, then what happens.?
@Ralender can you try the FAIL after xclbin and annotate the issue. It could be emulation related and I will assign the issue accordingly.
@Ralender . Thanks. This is emulation and I am not quite sure what assumptions emulation makes in regards to xclbin loaded or not prior to buffer allocation. I suspect the problem is the loading of the xclbin. If you move FAIL immediately after loading of xclbin, does it still fail?
yes, I updated the cpp and zip.
Still of course, if xclbin is required to be loaded, then your FAIL allocation should truly have failed with bad_alloc. That would be the case in HW where the current MEM_TOPOLOGY known by driver must at a minimum match the bank you request in BO allocation.
the banks should match the banks expected by the xclbin. everything is at 0.
@Ralender can you try the FAIL after xclbin and annotate the issue. It could be emulation related and I will assign the issue accordingly.
I am not sure what to annotate it as. I am compiling for hw to know if it is emulation related.
both creating the bo before and after loading the xclbin work on hw. so it is an emulation issue
Hi @Ralender, could you please file a CR with the test case and steps to reproduce. I would like to assign it to my team member for verification of the issue and investigation.
I have tried to run it locally. Seems I need to clang++ to follow the Steps to Reproduce for this. Can we use g++ for compiling the host code instead of clang++ for verifying why it is crashing in XRT etc.
@venkatp-xilinx Yes I think that g++ will work too.