onload icon indicating copy to clipboard operation
onload copied to clipboard

UDP Multicast message segmentation failed on read

Open asaiddd opened this issue 2 years ago • 0 comments

Hi Onload Team,

I am using onload version 7.1.0.265 on my production code and getting segmentation fault on linux socket librarys' read() function.

This function gets SEGV very rarely, I cannot give the exact frequency but it is almost 1 in 10 million UDP messages. I think this is a race condition in read udp messages.

I am sharing the call stack of the application when it is crashed -> it is almost same for every time, I knew that SEGV can occur in anywhere.

1)/lib64/libpthread.so.0(+0x12c20) [0x7fe2e1104c20] 2)/lib64/libonload.so(+0xe52c6) [0x7fe2e13f72c6] 3)/lib64/libonload.so(+0xe5f31) [0x7fe2e13f7f31] 4)/lib64/libonload.so(+0x41fb3) [0x7fe2e1353fb3] 5)/lib64/libonload.so(read+0x139) [0x7fe2e1325e59]

corresponding functions are like that 5) read+0x139 -> I could not resolve for this offset, so I don't know exact location of that call in the stack 4)citp_udp_recv /home/onload_tests/release-package/onload-7.1.3.202/build/gnu_x86_64/lib/transport/unix/../../../../../src/lib/transport/unix/udp_fd.c:402 3)ci_udp_recvmsg /home/onload_tests/release-package/onload-7.1.3.202/build/gnu_x86_64/lib/transport/ip/../../../../../src/lib/transport/ip/udp_recv.c:1021 2)ci_netif_has_event /home/onload_tests/release-package/onload-7.1.3.202/build/gnu_x86_64/lib/transport/ip/../../../../../src/include/ci/internal/ip.h:2644

our call in the application is like that: image

After getting this error I have upgraded onload version to 7.1.3.202 on a test setup and installed it with --debug option, ./onload_install --debug

a segmentation fault happened @Tue Mar 22 20.56.21 2022 and I am sharing the dmesg -T output in below.

[Tue Mar 22 20:56:20 2022] oo:[2445269]: FAIL at ../../../../../src/lib/transport/ip/tcp_sleep.c:243 [Tue Mar 22 20:56:20 2022] oo:[2445269]: ci_assert(ci_sock_is_locked(ni, w)) from ../../../../../src/lib/transport/ip/tcp_sleep.c:243 [Tue Mar 22 20:56:20 2022] oo:OscTrader[2445269]: hostname=pid=2445269 [Tue Mar 22 20:56:22 2022] [sfc efrm] efrm_pt_flush: [rs:256,00000000aa820e0f] EVQ=2048 TXQ=512 RXQ=512 [Tue Mar 22 20:56:22 2022] [sfc efrm] __efrm_vi_resource_issue_flush: rx queue 256 flush requested for nic 0 [Tue Mar 22 20:56:22 2022] [sfc efrm] Flushed queue nic 0 type 1 0x100 rc 0 [Tue Mar 22 20:56:22 2022] [sfc efrm] __efrm_vi_resource_issue_flush: tx queue 256 flush requested for nic 0 [Tue Mar 22 20:56:22 2022] [sfc efrm] efrm_handle_dmaq_flushed: nic_i=0 instance=256 rx_flush=1 failed=0 [Tue Mar 22 20:56:22 2022] [sfc efrm] efrm_handle_dmaq_flushed: nic_i=0 instance=256 rx_flush=0 failed=0 [Tue Mar 22 20:56:22 2022] [sfc efrm] efrm_vi_rm_delayed_free: 00000000901a0c5d [Tue Mar 22 20:56:22 2022] [sfc efrm] efrm_vi_rm_delayed_free: flushed VI instance=256 [Tue Mar 22 20:56:22 2022] [sfc efrm] Flushed queue nic 0 type 0 0x100 rc 0 [Tue Mar 22 20:56:22 2022] [sfc efrm] efrm_pt_flush: [rs:256,000000008cb28171] EVQ=2048 TXQ=512 RXQ=512 [Tue Mar 22 20:56:22 2022] [sfc efrm] __efrm_vi_resource_issue_flush: rx queue 256 flush requested for nic 1 [Tue Mar 22 20:56:22 2022] [sfc efrm] efrm_handle_dmaq_flushed: nic_i=1 instance=256 rx_flush=1 failed=0 [Tue Mar 22 20:56:22 2022] [sfc efrm] Flushed queue nic 1 type 1 0x100 rc 0 [Tue Mar 22 20:56:22 2022] [sfc efrm] __efrm_vi_resource_issue_flush: tx queue 256 flush requested for nic 1 [Tue Mar 22 20:56:22 2022] [sfc efrm] efrm_handle_dmaq_flushed: nic_i=1 instance=256 rx_flush=0 failed=0 [Tue Mar 22 20:56:22 2022] [sfc efrm] efrm_vi_rm_delayed_free: 00000000901a0c5d [Tue Mar 22 20:56:22 2022] [sfc efrm] efrm_vi_rm_delayed_free: flushed VI instance=256 [Tue Mar 22 20:56:22 2022] [sfc efrm] Flushed queue nic 1 type 0 0x100 rc 0 [Tue Mar 22 20:56:22 2022] [sfc efrm] efrm_vi_rm_free_flushed_resource: [rs:256,00000000aa820e0f] [Tue Mar 22 20:56:22 2022] [sfc efrm] __efrm_vi_resource_free: Freeing 256 [Tue Mar 22 20:56:22 2022] [sfc efrm] Flushed queue nic 0 type 2 0x100 rc 0 [Tue Mar 22 20:56:22 2022] [sfc efrm] efrm_vi_rm_free_flushed_resource: [rs:256,000000008cb28171] [Tue Mar 22 20:56:22 2022] [sfc efrm] __efrm_vi_resource_free: Freeing 256 [Tue Mar 22 20:56:22 2022] [sfc efrm] Flushed queue nic 1 type 2 0x100 rc 0

I found an issue on xilinx support website. Both of the versions I am using are not included in the affected release list https://support.xilinx.com/s/article/75067?language=en_US

Do you have any idea why it is happening or any existing solution for the problem? If you need any extra debug info or test case you can reach me by mail from ahmetsaidtekkurt at gmail.com

Thank you in advance, A.Sait Tekkurt

asaiddd avatar Mar 23 '22 13:03 asaiddd