dlt-daemon icon indicating copy to clipboard operation
dlt-daemon copied to clipboard

dlt_init() and dlt_free() functions are not thread safe

Open marcin-sochacki opened this issue 2 years ago • 14 comments

The unit test t_dlt_user_shutdown_while_init_is_running from gtest_dlt_user.cpp may fail if executed long enough (for example 60 seconds under Ubuntu) and if dlt-daemon was built with DLT_IPC=UNIX_SOCKET.

In the use case dlt_init() and dlt_free() are called simultaneously for 5 seconds. Unfortunately due to thread safety issues in dlt_user.c code, resources (in this case sockets) get allocated and are never freed. It is only a matter of execution time to reach the OS limits, which eventually lead to the test case failure.

The issue was observed with v2.18.9 and with v2.18.10.

marcin-sochacki avatar Aug 17 '23 11:08 marcin-sochacki

Hello, can it be also the root cause of this s390x test failure? https://buildd.debian.org/status/fetch.php?pkg=dlt-daemon&arch=s390x&ver=2.18.10-1&stamp=1692269034&raw=0

LocutusOfBorg avatar Aug 18 '23 06:08 LocutusOfBorg

Hello, can it be also the root cause of this s390x test failure? https://buildd.debian.org/status/fetch.php?pkg=dlt-daemon&arch=s390x&ver=2.18.10-1&stamp=1692269034&raw=0

No - it is a different issue.

marcin-sochacki avatar Aug 18 '23 08:08 marcin-sochacki

Hello @marcin-sochacki ,

i could not observe any leaks with:

cmake .. -DCMAKE_BUILD_TYPE=Debug -DWITH_DLT_UNIT_TESTS=ON -DDLT_IPC=UNIX_SOCKET

make -j16

./src/daemon/dlt-daemon &

valgrind ./tests/gtest_dlt_user --leak-check=full --gtest_filter=t_dlt_user_shutdown_while_init_is_running.normal

Can you please provide exact steps to reproduce and the log file?

michael-methner avatar Aug 18 '23 09:08 michael-methner

I edited tests/gtest_dlt_user.cpp In line: 5315

I replaced: const auto max_runtime = std::chrono::seconds(5); with const auto max_runtime = std::chrono::seconds(60);

Then I compiled dlt-daemon project, but I didn't start dlt-daemon, only gtest_dlt_user.

If dlt-daemon is running, t_dlt_user_shutdown_while_init_is_running passes OK. If not - the test case fails. Valgrind can only detect memory errors and will not help to detect situation that too many sockets were created. You can intrument the code to print socket fd in dlt_initialize_socket_connection()


    int sockfd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
    printf("sockfd=%d\n",sockfd);
 

If you redirect the error stream to /dev/null and run the test case: ./tests/gtest_dlt_user --gtest_filter=t_dlt_user_shutdown_while_init_is_running.normal 2>/dev/null

At one point of time you will see sockfd=-1, which indicates that the OS was not able to create more sockets.

sockfd=1023
sockfd=1023
sockfd=1023
sockfd=-1
/home/msochacki/Projects/dlt-daemon/tests/gtest_dlt_user.cpp:5335: Failure
Expected equality of these values:
  last_free
    Which is: -1
  DLT_RETURN_OK
    Which is: 0
[  FAILED  ] t_dlt_user_shutdown_while_init_is_running.normal (60000 ms)
[----------] 1 test from t_dlt_user_shutdown_while_init_is_running (60000 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (60000 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] t_dlt_user_shutdown_while_init_is_running.normal

 1 FAILED TEST

marcin-sochacki avatar Aug 18 '23 10:08 marcin-sochacki

If you say that running gtest_dlt_user without dlt-daemon is an invalid test case - it is OK for me :)

marcin-sochacki avatar Aug 18 '23 11:08 marcin-sochacki

Hello @alexmohr , I believe the feature for thread safe was implemented by you: https://github.com/COVESA/dlt-daemon/commit/2224cddf140c060d92455ad3ee585e3abfc38eb2

Could you kindly have a look in this issue? Thank you so much

minminlittleshrimp avatar Aug 18 '23 14:08 minminlittleshrimp

@minminlittleshrimp I don't think the issue is related to thread safety. But if there is no daemon running as @marcin-sochacki stated, we do leak file descriptors if logging mode is file. There might be other paths as well that leak file descriptors.

DltReturnValue dlt_init(void)
{
...
 /* Check logging mode and internal log file is opened or not*/
    if (logging_mode == DLT_LOG_TO_FILE && logging_handle == NULL) {
        dlt_log_init(logging_mode);
    }

    /* Initialize common part of dlt_init()/dlt_init_file() */
    if (dlt_init_common() == DLT_RETURN_ERROR) {
        dlt_user_init_state = INIT_UNITIALIZED;
        ////////////////////
        // --------> if logging mode is file, the FD we opened in dlt_log_init is never closed
        ////////////////////
        return DLT_RETURN_ERROR;
    }
...
}

alexmohr avatar Aug 18 '23 15:08 alexmohr

I did some debugging and observed that the function dlt_initialize_socket_connection() is called in the context of the main thread from dlt_init() and in the context of the housekeeper thread from dlt_user_log_reattach_to_daemon()

marcin-sochacki avatar Aug 18 '23 15:08 marcin-sochacki

Hello @lvklevankhanh Could you kindly look into this issue, it's been 2 months now? Thank you. DLT must be thread-safe implementation, any idea from contributors? Regards

minminlittleshrimp avatar Nov 19 '23 06:11 minminlittleshrimp

Hello @Bichdao021195 This bug I think related to our refactoring locking mechanism in DLT. We already refactor by replacing signaling with locking mechanism, could you check again if the issue fixed yet? Regards

minminlittleshrimp avatar Feb 05 '24 03:02 minminlittleshrimp

It looks, that the test case t_dlt_user_shutdown_while_init_is_running still fails, if you run this test longer (for example for 15 seconds) and if you build daemon with UNIX SOCKETS as IPC.

Edit tests/gtest_dlt_user.cpp const auto max_runtime = std::chrono::seconds(15);

Build dlt-daemon and execute tests:

mkdir build
cd build
cmake -DWITH_DLT_UNIT_TESTS=ON -DDLT_IPC=UNIX_SOCKET ..
make -j16
make test


2/15 Test  #2: gtest_dlt_user ............................***Failed   16.01 sec

I run the tests on bf947b30b96d53b2944106359e2224e2ea419074 from master branch on Ubuntu 22.04

marcin-sochacki avatar Mar 13 '24 09:03 marcin-sochacki

Hello @marcin-sochacki Any update on this issue. I can reproduce the case, but just occasionally, with longer period (16 seconds). Still I have no clue how to handle this test case correctly.

minminlittleshrimp avatar May 08 '24 09:05 minminlittleshrimp

Hello @marcin-sochacki Any update on this issue. I can reproduce the case, but just occasionally, with longer period (16 seconds). Still I have no clue how to handle this test case correctly.

I think the test case is correct, but the tested code isn't. There might be a user app startup/shutdown redesign required. You may want to implement a thread safe state machine and properly handle dlt_init() and dlt_free() in each state.

I understand that such changes always introduce a risk of introducing new issues, so you have to decide if you can live with this...

marcin-sochacki avatar May 08 '24 10:05 marcin-sochacki

was the leaking fd when no connection is made to dlt-daemon (https://github.com/COVESA/dlt-daemon/issues/522#issuecomment-1684068925) fixed already? If not we probably should fix that first and see if this resolves the issue. Otherwise we can still re-work the thread safety issue. Ignoring this isn't a good idea as libdlt should be thread safe.

alexmohr avatar May 08 '24 10:05 alexmohr