ouroboros icon indicating copy to clipboard operation
ouroboros copied to clipboard

keep-alive packets trigger fevent for raw flows

Open th1j5 opened this issue 2 years ago • 11 comments

A (minimal) working example to trigger the bug is found on this branch

  1. start sudo irmd --stdout
  2. make local layer
irm ipcp create type local name local_ipcp
irm ipcp bootstrap type local name local_ipcp layer local_layer
irm name register oping_server layer local_layer
irm bind prog oping name oping_server
  1. start oping --listen
  2. start oping -n oping_server -c 5

observe after the oping client has stopped sending pings (but is still connected and sending keep-alives due to sleep(50)) that on the server side (printf statements added in branch) we get:

Type of event: 1.
message len: -11
Type of event: 1.
message len: -11
Type of event: 1.
message len: -11
Type of event: 32.
message len: -1006

~~The -1006 is #define EFLOWPEER 1006 /* Flow is down (peer timed out) */ but those three flow_read events before (-11) are EAGAIN 11 Resource temporarily unavailable~~

~~using the poor mans debugger (printf), I can say it is due to https://ouroboros.rocks/cgit/ouroboros/tree/src/lib/dev.c#n1345~~

th1j5 avatar Jun 13 '22 11:06 th1j5

This (the message len of -EAGAIN) was the expected behaviour on a non-blocking flow. The question remains if an 'fqueue' event needs to be emitted for each keep-alive packet

th1j5 avatar Jun 13 '22 12:06 th1j5

Thus, keep-alive packets trigger an fevent (with type FLOW_PKT) when they should not.

I'm not sure if they should be filtered out in fqueue_filter (they should not take this if-statement) or elsewhere I'm not familiar enough with the codebase to say more

th1j5 avatar Jun 13 '22 15:06 th1j5

This is indeed a bug, for non-FRCT flows, the keepalives aren't filtered out due to that if-statement.

@th1j5 can you check that this is fixed in

https://github.com/dstaesse/ouroboros/tree/be-fix-fevent-keepalive

dstaesse avatar Jun 13 '22 17:06 dstaesse

No, the keepalives are indeed filtered out in fevent, but now the client cannot dealloc (no idea why, a bit late for me to figure it out tonight) see https://github.com/th1j5/ouroboros/tree/bug-fevent-keepalive for a MWE (instructions are the same as first comment)

th1j5 avatar Jun 13 '22 22:06 th1j5

This is what I get for the client when it's blocked, so it will have to do with a lock, I guess

(gdb) bt
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f6b3a311098) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f6b3a311018, cond=0x7f6b3a311070) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7f6b3a311070, mutex=0x7f6b3a311018) at pthread_cond_wait.c:647
#3  0x00007f6b39ede691 in shm_rbuff_fini (rb=0x55e405143340) at /mnt/data/UGent/2021-2022/Thesis/tools/ouroboros/src/lib/shm_rbuff_pthr.c:288
#4  0x00007f6b3a10c869 in flow_dealloc (fd=64) at /mnt/data/UGent/2021-2022/Thesis/tools/ouroboros/src/lib/dev.c:997
#5  0x000055e404fa1286 in client_main () at /mnt/data/UGent/2021-2022/Thesis/tools/ouroboros/src/tools/oping/oping_client.c:275
#6  0x000055e404fa2235 in main (argc=0, argv=0x7fffca896390) at /mnt/data/UGent/2021-2022/Thesis/tools/ouroboros/src/tools/oping/oping.c:273

th1j5 avatar Jun 14 '22 20:06 th1j5

No, there's probably unexpected packets left in the buffer (it only exits when all remaining queued packets are cleaned up). I staged packets after the keepalives from event using the mechanism we have for partial reads. That staging area will probably still contain an packet reference on exit. I will look into it tonight.

dstaesse avatar Jun 15 '22 07:06 dstaesse

@th1j5, this doesn't seem to reproduce for me...

I compiled your branch, and had no issues with dealloc client side... Also added a 50 second sleep, which I didn't find anymore in the MWE, and client dealloc also still works:

[dstaesse@heteropoda build]$ oping -n oping -c 5
Pinging oping with 64 bytes of data (5 packets):

64 bytes from oping: seq=0 time=0.611 ms
64 bytes from oping: seq=1 time=0.441 ms
64 bytes from oping: seq=2 time=0.448 ms
64 bytes from oping: seq=3 time=0.450 ms
64 bytes from oping: seq=4 time=0.424 ms

--- oping ping statistics ---
5 packets transmitted, 5 received, 0 out-of-order, 0% packet loss, time: 5001.413 ms
rtt min/avg/max/mdev = 0.424/0.475/0.611/0.077 ms
Sleeping 50 seconds.
deallocing client
deallocing client done

Server also deallocated just fine.

Does it always reproduce for you?

dstaesse avatar Jun 15 '22 17:06 dstaesse

No, it seems it's not reliably reproduced, I just tested it and the different results were:

  1. I reproduced the problem
  2. It deallocated just fine
  3. oping client stopped midway after 3 pings (this with commit https://github.com/th1j5/ouroboros/commit/726e6b5a34d5dffeb5e5a728bc5862a16ac84055)

th1j5 avatar Jun 15 '22 17:06 th1j5

Which glibc version are you using?

[dstaesse@heteropoda build]$ ldd --version
ldd (GNU libc) 2.35

Reason I'm asking is that I'm aware of this issue in glibc: https://sourceware.org/bugzilla/show_bug.cgi?id=25847

The issue only happens with the fevent fix, and not on be?

dstaesse avatar Jun 15 '22 18:06 dstaesse

$ ldd --version
ldd (Ubuntu GLIBC 2.31-0ubuntu9.9) 2.31

and yes, it only happened on the fevent fix, never encountered it on be

th1j5 avatar Jun 15 '22 18:06 th1j5

Got it to reproduce after some trying:

[dstaesse@heteropoda ouroboros]$ oping -n oping -c 5 -D
Pinging oping with 64 bytes of data (5 packets):

[1655319165.174923] 64 bytes from oping: seq=0 time=0.542 ms
[1655319166.174984] 64 bytes from oping: seq=1 time=0.450 ms
[1655319167.175143] 64 bytes from oping: seq=2 time=0.485 ms
[1655319168.175318] 64 bytes from oping: seq=3 time=0.496 ms
[1655319169.175472] 64 bytes from oping: seq=4 time=0.482 ms

--- oping ping statistics ---
5 packets transmitted, 5 received, 0 out-of-order, 0% packet loss, time: 5001.366 ms
rtt min/avg/max/mdev = 0.450/0.491/0.542/0.033 ms
deallocing client
deallocing client done
[dstaesse@heteropoda ouroboros]$ oping -n oping -c 5 -D
Pinging oping with 64 bytes of data (5 packets):

[1655319171.497064] 64 bytes from oping: seq=0 time=754.671 ms
[1655319172.496629] 64 bytes from oping: seq=1 time=754.077 ms
[1655319173.494265] 64 bytes from oping: seq=2 time=751.549 ms
[1655319174.497212] 64 bytes from oping: seq=3 time=754.333 ms
[1655319175.494873] 64 bytes from oping: seq=4 time=751.835 ms

--- oping ping statistics ---
5 packets transmitted, 5 received, 0 out-of-order, 0% packet loss, time: 5752.795 ms
rtt min/avg/max/mdev = 751.549/753.293/754.671/1.480 ms
deallocing client
deallocing client done
[dstaesse@heteropoda ouroboros]$ oping -n oping -c 5 -D
Pinging oping with 64 bytes of data (5 packets):

[1655319179.183382] 64 bytes from oping: seq=0 time=0.619 ms
[1655319180.935104] 64 bytes from oping: seq=1 time=752.148 ms
[1655319181.937768] 64 bytes from oping: seq=2 time=754.656 ms
[1655319182.935610] 64 bytes from oping: seq=3 time=752.360 ms
[1655319183.938488] 64 bytes from oping: seq=4 time=755.098 ms

--- oping ping statistics ---
5 packets transmitted, 5 received, 0 out-of-order, 0% packet loss, time: 5756.161 ms
rtt min/avg/max/mdev = 0.619/602.976/755.098/336.731 ms
deallocing client

Hangs here

Thread 3 (Thread 0x7fd8b930e640 (LWP 8021) "oping"):
#0  0x00007fd8bae89096 in ?? () from /usr/lib/libc.so.6
#1  0x00007fd8bae9298f in pthread_rwlock_wrlock () from /usr/lib/libc.so.6
#2  0x00007fd8bb647c77 in flow_send_keepalive (flow=0x7fd8b9d19a10, now=...) at /tmp/bug/ouroboros/src/lib/dev.c:279
#3  0x00007fd8bb647e70 in _flow_keepalive (flow=0x7fd8b9d19a10) at /tmp/bug/ouroboros/src/lib/dev.c:322
#4  0x00007fd8bb647ee7 in handle_keepalives () at /tmp/bug/ouroboros/src/lib/dev.c:338
#5  0x00007fd8bb647ff4 in flow_rx (o=0x0) at /tmp/bug/ouroboros/src/lib/dev.c:364
#6  0x00007fd8bae8c54d in ?? () from /usr/lib/libc.so.6
#7  0x00007fd8baf11874 in clone () from /usr/lib/libc.so.6

Thread 2 (Thread 0x7fd8b9b0f640 (LWP 8020) "oping"):
#0  0x00007fd8baf01bbc in read () from /usr/lib/libc.so.6
#1  0x00007fd8bb552d12 in ?? () from /usr/lib/libfuse.so.2
#2  0x00007fd8bb5548ae in ?? () from /usr/lib/libfuse.so.2
#3  0x00007fd8bb552fb9 in fuse_session_loop () from /usr/lib/libfuse.so.2
#4  0x00007fd8bb62e38f in fuse_thr (o=0x55555aaf6d60) at /tmp/bug/ouroboros/src/lib/rib.c:268
#5  0x00007fd8bae8c54d in ?? () from /usr/lib/libc.so.6
#6  0x00007fd8baf11874 in clone () from /usr/lib/libc.so.6

Thread 1 (Thread 0x7fd8bb50a740 (LWP 8018) "oping"):
#0  0x00007fd8bae89119 in ?? () from /usr/lib/libc.so.6
#1  0x00007fd8bae8b920 in pthread_cond_wait () from /usr/lib/libc.so.6
#2  0x00007fd8bb631cfa in shm_rbuff_fini (rb=0x55555ab176d0) at /tmp/bug/ouroboros/src/lib/shm_rbuff_pthr.c:288
#3  0x00007fd8bb649f8a in flow_dealloc (fd=64) at /tmp/bug/ouroboros/src/lib/dev.c:997
#4  0x00005555597fd069 in client_main () at /tmp/bug/ouroboros/src/tools/oping/oping_client.c:279
#5  0x00005555597fe0a3 in main (argc=0, argv=0x7ffc9cdaced8) at /tmp/bug/ouroboros/src/tools/oping/oping.c:273

Thr 3 is waiting for a read lock

(gdb) thr 3
[Switching to thread 3 (Thread 0x7fd8b930e640 (LWP 8021))]
#0  0x00007fd8bae89096 in ?? () from /usr/lib/libc.so.6
(gdb) fr 2
#2  0x00007fd8bb647c77 in flow_send_keepalive (flow=0x7fd8b9d19a10, now=...) at /tmp/bug/ouroboros/src/lib/dev.c:279
279             pthread_rwlock_wrlock(&ai.lock);
(gdb) print ai.lock
$1 = {__data = {__readers = 10, __writers = 0, __wrphase_futex = 2, __writers_futex = 1, __pad3 = 0, __pad4 = 0, __cur_writer = 0, __shared = 0, __rwelision = 0 '\000', 
    __pad1 = "\000\000\000\000\000\000", __pad2 = 0, __flags = 0}, __size = "\n\000\000\000\000\000\000\000\002\000\000\000\001", '\000' <repeats 42 times>, __align = 10}
(gdb) 

Thr 1 has that read lock, taken on l 975.

Locks are as expected, so that rbuff still has packets:

(gdb) thr 1
[Switching to thread 1 (Thread 0x7fd8bb50a740 (LWP 8018))]
#0  0x00007fd8bae89119 in ?? () from /usr/lib/libc.so.6
(gdb) fr 2
#2  0x00007fd8bb631cfa in shm_rbuff_fini (rb=0x55555ab176d0) at /tmp/bug/ouroboros/src/lib/shm_rbuff_pthr.c:288
288                     if (pthread_cond_wait(rb->del, rb->lock) == EOWNERDEAD)
(gdb) print *rb->head
$2 = 11
(gdb) print *rb->tail
$3 = 10

This will take a bit more work then :)

dstaesse avatar Jun 15 '22 19:06 dstaesse