netopeer2 test_sub_ntf_advanced fails with Child aborted

The test suite started to fail when building locally:

$ ./tests/test_sub_ntf_advanced
[==========] Running 18 test(s).
[ RUN      ] test_filter_pass
[       OK ] test_filter_pass
[ RUN      ] test_filter_no_pass
[       OK ] test_filter_no_pass
[ RUN      ] test_modifysub_filter
[       OK ] test_modifysub_filter
[ RUN      ] test_modifysub_stop_time
[       OK ] test_modifysub_stop_time
[ RUN      ] test_modifysub_fail_no_such_sub
[       OK ] test_modifysub_fail_no_such_sub
[ RUN      ] test_deletesub
[       OK ] test_deletesub
[ RUN      ] test_deletesub_fail
[       OK ] test_deletesub_fail
[ RUN      ] test_deletesub_fail_diff_sess
[       OK ] test_deletesub_fail_diff_sess
[ RUN      ] test_ds_subscriptions
[       OK ] test_ds_subscriptions
[ RUN      ] test_ds_subscriptions_sent_event
[       OK ] test_ds_subscriptions_sent_event
[ RUN      ] test_ds_subscriptions_excluded_event
[       OK ] test_ds_subscriptions_excluded_event
[ RUN      ] test_multiple_subscriptions
[       OK ] test_multiple_subscriptions
[ RUN      ] test_multiple_subscriptions_notif
0x1 != 0x6
[   LINE   ] --- ./tests/test_sub_ntf_advanced.c:650: error: Failure!Aborted (core dumped)

Coredump (but it doesn't have any extra details)

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
        set = {__val = {0, 0, 8297038250658185793, 2322297360617776384, 0 <repeats 12 times>}}
        pid = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
#1  0x00007fd95642c537 in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x7fd955d732a0, sa_sigaction = 0x7fd955d732a0}, sa_mask = {__val = {18, 94192816795984, 255, 1, 94192802424816, 705, 77, 94192816651200, 255, 140728008513688, 94192816652912, 140728008513688, 140571431947340, 0, 94192815835544, 140728008513664}}, sa_flags = -889903416, sa_restorer = 0x55aaf9540998}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007fd9567df44e in exit_test () from /usr/local/lib/libcmocka.so.0
No symbol table info available.
#3  0x00007fd9567e2d26 in _fail () from /usr/local/lib/libcmocka.so.0
No symbol table info available.
#4  0x00007fd9567e207b in _assert_int_equal () from /usr/local/lib/libcmocka.so.0
No symbol table info available.
#5  0x000055aaf8874509 in test_multiple_subscriptions_notif_interlaced (state=<optimized out>) at ./tests/test_sub_ntf_advanced.c:705
        st = 0x55aaf9540730
        data = <optimized out>
#6  0x00007fd9567e40bf in cmocka_run_one_test_or_fixture () from /usr/local/lib/libcmocka.so.0
No symbol table info available.
#7  0x00007fd9567e438d in cmocka_run_one_tests () from /usr/local/lib/libcmocka.so.0
No symbol table info available.
#8  0x00007fd9567e4921 in _cmocka_run_group_tests () from /usr/local/lib/libcmocka.so.0
No symbol table info available.
#9  0x000055aaf886b764 in main (argc=1, argv=0x7ffdcaf526c8) at ./tests/test_sub_ntf_advanced.c:1018
        tests = {{name = 0x55aaf8876544 "test_filter_pass", test_func = 0x55aaf886e120 <test_filter_pass>, setup_func = 0x0, teardown_func = 0x55aaf886c570 <teardown_common>, initial_state = 0x0}, {name = 0x55aaf8876555 "test_filter_no_pass", test_func = 0x55aaf886d9e0 <test_filter_no_pass>, setup_func = 0x0, teardown_func = 0x55aaf886c570 <teardown_common>, initial_state = 0x0}, {name = 0x55aaf8876569 "test_modifysub_filter", test_func = 0x55aaf8871c30 <test_modifysub_filter>, setup_func = 0x0, teardown_func = 0x55aaf886c570 <teardown_common>, initial_state = 0x0}, {name = 0x55aaf887657f "test_modifysub_stop_time", test_func = 0x55aaf88725c0 <test_modifysub_stop_time>, setup_func = 0x0, teardown_func = 0x55aaf886c570 <teardown_common>, initial_state = 0x0}, {name = 0x55aaf8876700 "test_modifysub_fail_no_such_sub", test_func = 0x55aaf886d5c0 <test_modifysub_fail_no_such_sub>, setup_func = 0x0, teardown_func = 0x0, initial_state = 0x0}, {name = 0x55aaf8876598 "test_deletesub", test_func = 0x55aaf886f250 <test_deletesub>, setup_func = 0x0, teardown_func = 0x55aaf886c570 <teardown_common>, initial_state = 0x0}, {name = 0x55aaf88765a7 "test_deletesub_fail", test_func = 0x55aaf886d1a0 <test_deletesub_fail>, setup_func = 0x0, teardown_func = 0x0, initial_state = 0x0}, {name = 0x55aaf88765bb "test_deletesub_fail_diff_sess", test_func = 0x55aaf88712f0 <test_deletesub_fail_diff_sess>, setup_func = 0x0, teardown_func = 0x55aaf886c570 <teardown_common>, initial_state = 0x0}, {name = 0x55aaf88765d9 "test_ds_subscriptions", test_func = 0x55aaf886e5c0 <test_ds_subscriptions>, setup_func = 0x0, teardown_func = 0x55aaf886c570 <teardown_common>, initial_state = 0x0}, {name = 0x55aaf8876720 "test_ds_subscriptions_sent_event", test_func = 0x55aaf886eb30 <test_ds_subscriptions_sent_event>, setup_func = 0x0, teardown_func = 0x55aaf886c570 <teardown_common>, initial_state = 0x0}, {name = 0x55aaf8876748 "test_ds_subscriptions_excluded_event", test_func = 0x55aaf88709f0 <test_ds_subscriptions_excluded_event>, setup_func = 0x0, teardown_func = 0x55aaf886c570 <teardown_common>, initial_state = 0x0}, {name = 0x55aaf88765ef "test_multiple_subscriptions", test_func = 0x55aaf88701d0 <test_multiple_subscriptions>, setup_func = 0x0, teardown_func = 0x55aaf886c570 <teardown_common>, initial_state = 0x0}, {name = 0x55aaf8876770 "test_multiple_subscriptions_notif", test_func = 0x55aaf886f9f0 <test_multiple_subscriptions_notif>, setup_func = 0x0, teardown_func = 0x55aaf886c570 <teardown_common>, initial_state = 0x0}, {name = 0x55aaf8876798 "test_multiple_subscriptions_notif_interlaced", test_func = 0x55aaf8874440 <test_multiple_subscriptions_notif_interlaced>, setup_func = 0x55aaf886cc50 <setup_notif2_data>, teardown_func = 0x55aaf886c570 <teardown_common>, initial_state = 0x0}, {name = 0x55aaf887660b "test_killsub_fail_nacm", test_func = 0x55aaf886de00 <test_killsub_fail_nacm>, setup_func = 0x0, teardown_func = 0x55aaf886cb10 <teardown_nacm>, initial_state = 0x0}, {name = 0x55aaf8876622 "test_killsub_fail_no_such_sub", test_func = 0x55aaf886cd80 <test_killsub_fail_no_such_sub>, setup_func = 0x55aaf886c9e0 <setup_test_killsub>, teardown_func = 0x55aaf886cb10 <teardown_nacm>, initial_state = 0x0}, {name = 0x55aaf8876640 "test_killsub_same_sess", test_func = 0x55aaf8872f60 <test_killsub_same_sess>, setup_func = 0x55aaf886c9e0 <setup_test_killsub>, teardown_func = 0x55aaf886cb10 <teardown_nacm>, initial_state = 0x0}, {name = 0x55aaf8876657 "test_killsub_diff_sess", test_func = 0x55aaf88739c0 <test_killsub_diff_sess>, setup_func = 0x55aaf886c9e0 <setup_test_killsub>, teardown_func = 0x55aaf886cb10 <teardown_nacm>, initial_state = 0x0}}

Nov 25 '21 15:11 oerdnj

Not sure how you got to this point but when executing the test binaries directly, they do not have all their ENV variables set properly, which may cause problems. Try make test_clean and running the test again. If it still fails in similar fashion, please mention what distribution you are on (and I assume you are using all our projects from current master branches) and I will try to reproduce it or at least improve the error output so that the actual problem is printed.

Nov 25 '21 15:11 michalvasko

That's from a clean build - I just went and run the test by hand.

Distribution is Debian bullseye, but libyang2, libnetconf2 and sysrepo libraries are the recent packaged versions compiled also locally.

Nov 25 '21 15:11 oerdnj

This is the package list I have installed locally:

ii  libnetconf2-2:amd64      2.0.24-3     amd64        NETCONF protocol library [C library]
ii  libnetconf2-dev:amd64    2.0.24-3     amd64        NETCONF protocol library [C development]
ii  libsysrepo-dev:amd64     2.0.53-5     amd64        YANG-based configuration and operational state data store [C development]
ii  libsysrepo6:amd64        2.0.53-5     amd64        YANG-based configuration and operational state data store [C library]
ii  libsysrepo6-dbgsym:amd64 2.0.53-5     amd64        debug symbols for libsysrepo6
ii  libyang2:amd64           2.0.112-6    amd64        parser toolkit for IETF YANG data modeling - runtime
ii  libyang2-dev:amd64       2.0.112-6    amd64        parser toolkit for IETF YANG data modeling - development files
ii  libyang2-tools           2.0.112-6    amd64        parser toolkit for IETF YANG data modeling - executable tools

Nov 25 '21 20:11 oerdnj

Try it now, I have increased the timeout, for some reason I have missed that 0x1 return is just an elapsed timeout. Before that, I have actually installed a new VM of Debian 11 and all the tests passed.

Nov 26 '21 10:11 michalvasko

It still times out (randomly):

[ RUN      ] test_multiple_subscriptions_notif
<rpc-reply xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="42">
  <rpc-error>
    <error-type>application</error-type>
    <error-tag>operation-failed</error-tag>
    <error-severity>error</error-severity>
    <error-message xml:lang="en">Callback event "rpc" with ID 12 processing timed out.</error-message>
  </rpc-error>
  <rpc-error>
    <error-type>application</error-type>
    <error-tag>operation-failed</error-tag>
    <error-severity>error</error-severity>
    <error-message xml:lang="en">User callback failed.</error-message>
  </rpc-error>
</rpc-reply>
[   LINE   ] --- ./tests/test_sub_ntf_advanced.c:650: error: Failure!

17/17 Test #16: test_confirmed_commit ............   Passed   10.49 sec

94% tests passed, 1 tests failed out of 17

Total Test time (real) =  10.51 sec

The following tests FAILED:
	 12 - test_sub_ntf_advanced (Child aborted)

or

[ RUN      ] test_multiple_subscriptions_notif_interlaced
<rpc-reply xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="42">
  <rpc-error>
    <error-type>application</error-type>
    <error-tag>operation-failed</error-tag>
    <error-severity>error</error-severity>
    <error-message xml:lang="en">Callback event "rpc" with ID 15 processing timed out.</error-message>
  </rpc-error>
  <rpc-error>
    <error-type>application</error-type>
    <error-tag>operation-failed</error-tag>
    <error-severity>error</error-severity>
    <error-message xml:lang="en">User callback failed.</error-message>
  </rpc-error>
</rpc-reply>
[   LINE   ] --- ./tests/test_sub_ntf_advanced.c:705: error: Failure!

The server is not that much loaded: top - 11:53:32 up 20 days, 17:55, 6 users, load average: 16.32, 10.47, 11.22 :)

Nevertheless, even bumping the timeout to 30 seconds doesn't do the trick it seems:

16/17 Test #12: test_sub_ntf_advanced ............Child aborted***Exception:   7.94 sec
[==========] Running 18 test(s).
[ RUN      ] test_filter_pass
[       OK ] test_filter_pass
[ RUN      ] test_filter_no_pass
[       OK ] test_filter_no_pass
[ RUN      ] test_modifysub_filter
[       OK ] test_modifysub_filter
[ RUN      ] test_modifysub_stop_time
[       OK ] test_modifysub_stop_time
[ RUN      ] test_modifysub_fail_no_such_sub
[       OK ] test_modifysub_fail_no_such_sub
[ RUN      ] test_deletesub
[       OK ] test_deletesub
[ RUN      ] test_deletesub_fail
[       OK ] test_deletesub_fail
[ RUN      ] test_deletesub_fail_diff_sess
[       OK ] test_deletesub_fail_diff_sess
[ RUN      ] test_ds_subscriptions
[       OK ] test_ds_subscriptions
[ RUN      ] test_ds_subscriptions_sent_event
[       OK ] test_ds_subscriptions_sent_event
[ RUN      ] test_ds_subscriptions_excluded_event
[       OK ] test_ds_subscriptions_excluded_event
[ RUN      ] test_multiple_subscriptions
[       OK ] test_multiple_subscriptions
[ RUN      ] test_multiple_subscriptions_notif
[       OK ] test_multiple_subscriptions_notif
[ RUN      ] test_multiple_subscriptions_notif_interlaced
<rpc-reply xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="42">
  <rpc-error>
    <error-type>application</error-type>
    <error-tag>operation-failed</error-tag>
    <error-severity>error</error-severity>
    <error-message xml:lang="en">Callback event "rpc" with ID 15 processing timed out.</error-message>
  </rpc-error>
  <rpc-error>
    <error-type>application</error-type>
    <error-tag>operation-failed</error-tag>
    <error-severity>error</error-severity>
    <error-message xml:lang="en">User callback failed.</error-message>
  </rpc-error>
</rpc-reply>
[   LINE   ] --- ./tests/test_sub_ntf_advanced.c:705: error: Failure!

17/17 Test #16: test_confirmed_commit ............   Passed   10.34 sec

And it certainly doesn't wait for 30 seconds in the poll().

Nov 26 '21 11:11 oerdnj

Okay, this is a different problem and another timeout elapses (you are on some really slow platform, are you not?). Try to add parameter -t 10 here, that should help with this. If it does, I will commit it.

Nov 26 '21 11:11 michalvasko

you are on some really slow platform, are you not? Umm, not really: AMD Ryzen 9 5900X 12-Core Processor (24 with HT)

The server is just busy compiling stuff and the tests are run all at once in parallel.

The change looks positive:

 1/17 Test  #1: headers ..........................   Passed    0.29 sec
 2/17 Test  #2: test_rpc .........................   Passed    4.76 sec
 3/17 Test #10: test_nacm ........................   Passed    4.87 sec
 4/17 Test  #4: test_filter ......................   Passed    4.98 sec
 5/17 Test #17: test_url .........................   Passed    4.96 sec
 6/17 Test  #3: test_edit ........................   Passed    5.24 sec
 7/17 Test  #9: test_with_defaults ...............   Passed    5.28 sec
 8/17 Test  #7: test_parallel_sessions ...........   Passed    5.42 sec
 9/17 Test  #8: test_candidate ...................   Passed    5.44 sec
10/17 Test #11: test_sub_ntf .....................   Passed    8.19 sec
11/17 Test  #6: test_subscribe_param .............   Passed    8.28 sec
12/17 Test #15: test_yang_push_advanced ..........   Passed    8.25 sec
13/17 Test  #5: test_subscribe_filter ............   Passed    8.87 sec
14/17 Test #13: test_sub_ntf_filter ..............   Passed    8.95 sec
15/17 Test #14: test_yang_push ...................   Passed    9.36 sec
16/17 Test #12: test_sub_ntf_advanced ............   Passed   12.09 sec
17/17 Test #16: test_confirmed_commit ............   Passed   13.12 sec

I'll keep the build running in the loop over lunch and get back in ~30 minutes, but so far no more timeouts.

Nov 26 '21 11:11 oerdnj

Nope, it failed again, but it is much more reliable with the aforementioned changes.

 1/17 Test  #1: headers ..........................   Passed    0.02 sec
 2/17 Test  #2: test_rpc .........................   Passed    2.12 sec
 3/17 Test  #9: test_with_defaults ...............   Passed    2.15 sec
 4/17 Test #10: test_nacm ........................   Passed    2.23 sec
 5/17 Test #17: test_url .........................   Passed    2.27 sec
 6/17 Test  #7: test_parallel_sessions ...........   Passed    2.35 sec
 7/17 Test  #8: test_candidate ...................   Passed    2.42 sec
 8/17 Test  #3: test_edit ........................   Passed    2.60 sec
 9/17 Test  #4: test_filter ......................   Passed    2.61 sec
10/17 Test  #6: test_subscribe_param .............   Passed    3.72 sec
11/17 Test #11: test_sub_ntf .....................   Passed    3.99 sec
12/17 Test  #5: test_subscribe_filter ............   Passed    4.42 sec
13/17 Test #13: test_sub_ntf_filter ..............   Passed    4.60 sec
14/17 Test #15: test_yang_push_advanced ..........   Passed    4.84 sec
15/17 Test #14: test_yang_push ...................   Passed    4.93 sec
16/17 Test #12: test_sub_ntf_advanced ............Child aborted***Exception:   7.93 sec
[==========] Running 18 test(s).
[ RUN      ] test_filter_pass
[       OK ] test_filter_pass
[ RUN      ] test_filter_no_pass
[       OK ] test_filter_no_pass
[ RUN      ] test_modifysub_filter
[       OK ] test_modifysub_filter
[ RUN      ] test_modifysub_stop_time
[       OK ] test_modifysub_stop_time
[ RUN      ] test_modifysub_fail_no_such_sub
[       OK ] test_modifysub_fail_no_such_sub
[ RUN      ] test_deletesub
[       OK ] test_deletesub
[ RUN      ] test_deletesub_fail
[       OK ] test_deletesub_fail
[ RUN      ] test_deletesub_fail_diff_sess
[       OK ] test_deletesub_fail_diff_sess
[ RUN      ] test_ds_subscriptions
[       OK ] test_ds_subscriptions
[ RUN      ] test_ds_subscriptions_sent_event
[       OK ] test_ds_subscriptions_sent_event
[ RUN      ] test_ds_subscriptions_excluded_event
[       OK ] test_ds_subscriptions_excluded_event
[ RUN      ] test_multiple_subscriptions
[       OK ] test_multiple_subscriptions
[ RUN      ] test_multiple_subscriptions_notif
[       OK ] test_multiple_subscriptions_notif
[ RUN      ] test_multiple_subscriptions_notif_interlaced
0x1 != 0x6
[   LINE   ] --- ./tests/test_sub_ntf_advanced.c:705: error: Failure!

17/17 Test #16: test_confirmed_commit ............   Passed   10.60 sec

Running the tests one at the time also didn't help to make it pass all the time, so I think I'll just disable this particular test in the package. The unit tests are more useful for the CI than for package building anyway.

Nov 26 '21 12:11 oerdnj

Nope, it failed again

I see, too bad.

tests are run all at once in parallel

Oh, okay, I have never tried this and do not think it is a good idea. It definitely does not represent any actual use-case, anyway. Moreover, it really makes it almost impossible to set the timeouts correctly, if the scheduler decides to not execute a thread with a server that is meant to send something, the receiver will always get a timeout. The solution, I believe, is to run the tests for a specific number of jobs that will not exceed the number of concurrent threads the system is able to execute.

Nov 26 '21 12:11 michalvasko

Moreover, it really makes it almost impossible to set the timeouts correctly, if the scheduler decides to not execute a thread with a server that is meant to send something, the receiver will always get a timeout.

Yes, but that's a case for any tests relying on a fixed timeout because the hardware can be busy also with something else. You need dedicated CPU time for executing such tests.

Testing is hard...

Nov 26 '21 12:11 oerdnj

I agree, but I still believe you will marginally lessen the chance of the timeout occurring if you stick to what I wrote because if you execute more threads that the system is able to run in parallel, some threads must be interrupted and the gamble is only whether it will be at a crucial moment or not.

But if the system has enough resources for running all the threads, it should not even interrupt any thread while it has some work to do so the aforementioned problem may not even have a chance to occur.

Nov 26 '21 12:11 michalvasko

I think that it's reasonable from the build system perspective to assume that if you run with N parallel threads, your system should really have "something close to N" cores available. This parallel testing is, anyway, something that you need to explicitly ask for. So if you've got a 64-way SMP system, use -j64, and then end up heavily over-subscribed because you're building other projects in parallel, I think that would be an user error. Just run with a smaller -j?

Nov 26 '21 12:11 jktjkt

Folks, this is becoming a purely academic debate. The machine has 24 cores and 17 unit (cmocka) tests are being run in parallel and the build system has no way of knowing how loaded the system will be at the time the tests will be run. Moreover, with virtualization, you don't have any guarantees about available CPU time at all.

Been there, done that, been burned by that several times - trying to trash threaded networking stack built on top of libuv is especially prone to errors from scheduler when ran from the CI.

We ended up disabling the most scheduling-sensitive unit tests from the CI and those are run only by hand on the developers' machines. The CI guests (and other types of build machines) are usually either unused or overcommited :).

Nov 26 '21 12:11 oerdnj

Note that except headers and format tests, all others always include at least 2 processes - netopeer2-server and the test itself - and even the test sometimes has more than one thread so you end up with 30+ individual threads. I guess my point is mostly that with -j4 on github actions (CI) it seems to run quite well. And even though there were timeout-related problems I had to solve, they were never related to concurrency but rather to the slow VM.

Nov 26 '21 13:11 michalvasko