sonic-swss icon indicating copy to clipboard operation
sonic-swss copied to clipboard

[mlagsyncd]Fix mlag socket read fail

Open pettershao-ragilenetworks opened this issue 3 years ago • 13 comments

What I did fix mclagysncd server socket read fail Why I did it if socket read fail, mclagsyncd can't communicate with iccpd, the whole mlag can't work How I verified it check the socket state Details if related

issue description: from below strace log, the read size for fd 34 is 0, which means the initiazation of this variable is not correct. MSG_BATCH_SIZE should be a macro, otherwise it may be inited after size and cause size 0 .

root@sonic:/home/admin# strace -f -p 6681 (only show 13(server fd)and 34(conn fd))
[pid  6681] setsockopt(13, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
[pid  6681] setsockopt(13, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
[pid  6681] bind(13, {sa_family=AF_INET, sin_port=htons(2626), sin_addr=inet_addr("127.0.0.6")}, 16) = 0
[pid  6681] listen(13, 2) 
[pid  6681] accept(13, {sa_family=AF_INET, sin_port=htons(40028), sin_addr=inet_addr("127.0.0.1")}, [32624->16]) = 34
[pid  6681] write(1, "Connected!\n", 11) = 11
[pid  6681] epoll_ctl(12, EPOLL_CTL_ADD, 34, {EPOLLIN, {u32=34, u64=34}}) = 0
[pid  6681] epoll_wait(12, [], 2, 0)    = 0
[pid  6681] epoll_wait(12,
[{EPOLLIN, {u32=34, u64=34}}], 2, -1) = 1
[pid  6681] read(34, "", 0)             = 0   <<====here size is empty, which is confirmed in log, both m_bufsize and m_pos is 0.
[pid  6681] sendto(9, "<13>Jul 21 06:50:44 mclagsyncd: "..., 65, MSG_NOSIGNAL, NULL, 0) = 65
[pid  6681] futex(0x7f707811d1a0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
[pid  6681] close(13)                   = 0

root@sonic:/home/admin# netstat -ap | grep 2626
tcp        0      0 127.0.0.6:2626          0.0.0.0:*               LISTEN      4817/mclagsyncd
tcp        946    0 127.0.0.6:2626          localhost:33166         ESTABLISHED 6681/mclagsyncd     //receive queue is not empty since read failed
tcp        0      0 localhost:33166         127.0.0.6:2626          ESTABLISHED 6682/iccpd


below show a normal read strace log:

[pid 15027] read(34, "\1\2\24\0\3\0\f\0test_sync_fd", 1048576) = 20

@lguohan @qiluo-msft @Praveen-Brcm help review this, is a sometime issue, thanks!

@lguohan @qiluo-msft @Praveen-Brcm help review this, is a sometime issue, thanks!

@pettershao-ragilenetworks Thanks for the change. Am not added as reviwer, if someone can add me will provide approals from my end. Thanks- Praveen.

Praveen-Brcm avatar Jul 22 '21 19:07 Praveen-Brcm

@lguohan @qiluo-msft help forward this, thanks!

@lguohan help merge this, thanks!

@lguohan @qiluo-msft help merge this, thanks!

@lguohan @qiluo-msft help forward this, thanks!

/azp run

qiluo-msft avatar Oct 08 '21 09:10 qiluo-msft

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Oct 08 '21 09:10 azure-pipelines[bot]

@lguohan @qiluo-msft help forward this, thanks!

/azp run

qiluo-msft avatar Oct 20 '21 09:10 qiluo-msft

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Oct 20 '21 09:10 azure-pipelines[bot]

@lguohan @qiluo-msft As it is approved long time ago, help merge this, thanks!

@pettershao-ragilenetworks The issue you tried to fix here is already fixed in https://github.com/Azure/sonic-swss/pull/2112 by fixing the initialization order.

msosyak avatar Jan 24 '22 09:01 msosyak