XRT icon indicating copy to clipboard operation
XRT copied to clipboard

Potential infinite loop in unix_socket::sk_read

Open lforg37 opened this issue 3 years ago • 6 comments

It seems that unix_socket::sk_read in runtime_src/core/pcie/emulation/common_em/unix_socket.cxx does not take into account the possibility of having less data on the socket than required.

The (r = read(fd, buf + rlen, count - rlen)) < 0 condition will never be reached if the socket is closed (0 would be assigned to r) producing an infinite loop.

This behaviour has been observed on standard code. I have not found why the socket sometimes contains less information than expected. The same program can freeze or not depending on the execution so it seems there is a race condition here.

XRT version : 4c83637fd4d4041a5cd4872a1391f812e54e143e Alveo platform : xilinx_u200_gen3x16_xdma_1_202110_1

stack trace when blocked :

#1  __GI___libc_read (fd=8, buf=0x55555559e200, nbytes=9) at ../sysdeps/unix/sysv/linux/read.c:24
#2  0x00007ffff7085701 in unix_socket::sk_read (this=0x555555593d60, rbuf=0x55555559e200, count=9)
    at XRT/src/runtime_src/core/pcie/emulation/common_em/unix_socket.cxx:131
#3  0x00007ffff7020f2f in xclhwemhal2::HwEmShim::xclFreeDeviceBuffer (this=0x55555558e0b0, offset=34359742464, sendtoxsim=true)
    at XRT/src/runtime_src/core/pcie/emulation/hw_em/generic_pcie_hal2/shim.cxx:1665
#4  0x00007ffff702d322 in xclhwemhal2::HwEmShim::xclFreeBO (this=0x55555558e0b0, boHandle=2)
    at XRT/src/runtime_src/core/pcie/emulation/hw_em/generic_pcie_hal2/shim.cxx:3128
#5  0x00007ffff6ff623d in operator() (__closure=0x7fffffffd800)
    atXRT/src/runtime_src/core/pcie/emulation/hw_em/generic_pcie_hal2/halapi.cxx:155
#6  0x00007ffff6ff62b4 in xdp::hw_emu::trace::profiling_wrapper<xclFreeBO(xclDeviceHandle, unsigned int)::<lambda()> >(const char *, struct {...} &&) (function=0x7ffff711e9fd "xclFreeBO", f=...)
    at XRT/src/runtime_src/core/pcie/emulation/hw_em/generic_pcie_hal2/plugin/xdp/hal_trace.h:79
#7  0x00007ffff6ff6334 in xclFreeBO (handle=0x55555558e0b0, boHandle=2)
    at XRT/src/runtime_src/core/pcie/emulation/hw_em/generic_pcie_hal2/halapi.cxx:151
#8  0x00007ffff6ff2057 in xrt_core::shim<xrt_core::device_pcie>::free_bo (this=0x555555593730, bo=2)
    at XRT/src/runtime_src/core/common/ishim.h:282
#9  0x00007ffff7d80a4e in xrt::bo_impl::~bo_impl (this=0x5555555b9b60, __in_chrg=<optimized out>)
    at XRT/src/runtime_src/core/common/api/xrt_bo.cpp:227
#10 0x00007ffff7d986ec in xrt::buffer_hbuf::~buffer_hbuf (this=0x5555555b9b60, __in_chrg=<optimized out>)
    at XRT/src/runtime_src/core/common/api/xrt_bo.cpp:448

lforg37 avatar Jan 26 '22 01:01 lforg37

Hi @akasat Please help assign this issue properly. Not sure why you removed the assignment without assigning someone else?

stsoe avatar Feb 11 '22 03:02 stsoe

I created a work-around for this in https://github.com/Xilinx/XRT/pull/6269

keryell avatar Feb 16 '22 19:02 keryell

This is tracked internally with https://jira.xilinx.com/browse/CR-1120194 and there is a non-SYCL pure XRT & HLS reproducer example in https://jira.xilinx.com/browse/XRT-937

keryell avatar Feb 18 '22 19:02 keryell

@sgundime-xilinx Identified the issue. Fix is in progress. The order of messageThread and unix_socket creation is updated. With this fix, we are not seeing any crash or segfault. Will create the PR shortly.

venkatp-xilinx avatar Apr 19 '22 08:04 venkatp-xilinx

What was the PR fixing this?

keryell avatar Dec 06 '23 22:12 keryell

The issue was resolved with an introduction of a monitoring flag which runs periodically. The read/write calls are protected with flag before really making calls. If any client/server gets disconnected then the thread gets notified with the flag. The CR-1120194 addressed this issue and resolved too.
PR: https://github.com/Xilinx/XRT/pull/6623

sgundime-xilinx avatar Dec 08 '23 13:12 sgundime-xilinx