dlt-daemon
dlt-daemon copied to clipboard
dlt_init() and dlt_free() functions are not thread safe
The unit test t_dlt_user_shutdown_while_init_is_running from gtest_dlt_user.cpp may fail if executed long enough (for example 60 seconds under Ubuntu) and if dlt-daemon was built with DLT_IPC=UNIX_SOCKET.
In the use case dlt_init() and dlt_free() are called simultaneously for 5 seconds. Unfortunately due to thread safety issues in dlt_user.c code, resources (in this case sockets) get allocated and are never freed. It is only a matter of execution time to reach the OS limits, which eventually lead to the test case failure.
The issue was observed with v2.18.9 and with v2.18.10.
Hello, can it be also the root cause of this s390x test failure? https://buildd.debian.org/status/fetch.php?pkg=dlt-daemon&arch=s390x&ver=2.18.10-1&stamp=1692269034&raw=0
Hello, can it be also the root cause of this s390x test failure? https://buildd.debian.org/status/fetch.php?pkg=dlt-daemon&arch=s390x&ver=2.18.10-1&stamp=1692269034&raw=0
No - it is a different issue.
Hello @marcin-sochacki ,
i could not observe any leaks with:
cmake .. -DCMAKE_BUILD_TYPE=Debug -DWITH_DLT_UNIT_TESTS=ON -DDLT_IPC=UNIX_SOCKET
make -j16
./src/daemon/dlt-daemon &
valgrind ./tests/gtest_dlt_user --leak-check=full --gtest_filter=t_dlt_user_shutdown_while_init_is_running.normal
Can you please provide exact steps to reproduce and the log file?
I edited tests/gtest_dlt_user.cpp In line: 5315
I replaced:
const auto max_runtime = std::chrono::seconds(5);
with
const auto max_runtime = std::chrono::seconds(60);
Then I compiled dlt-daemon project, but I didn't start dlt-daemon, only gtest_dlt_user.
If dlt-daemon is running, t_dlt_user_shutdown_while_init_is_running passes OK. If not - the test case fails. Valgrind can only detect memory errors and will not help to detect situation that too many sockets were created. You can intrument the code to print socket fd in dlt_initialize_socket_connection()
int sockfd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
printf("sockfd=%d\n",sockfd);
If you redirect the error stream to /dev/null and run the test case:
./tests/gtest_dlt_user --gtest_filter=t_dlt_user_shutdown_while_init_is_running.normal 2>/dev/null
At one point of time you will see sockfd=-1, which indicates that the OS was not able to create more sockets.
sockfd=1023
sockfd=1023
sockfd=1023
sockfd=-1
/home/msochacki/Projects/dlt-daemon/tests/gtest_dlt_user.cpp:5335: Failure
Expected equality of these values:
last_free
Which is: -1
DLT_RETURN_OK
Which is: 0
[ FAILED ] t_dlt_user_shutdown_while_init_is_running.normal (60000 ms)
[----------] 1 test from t_dlt_user_shutdown_while_init_is_running (60000 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (60000 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] t_dlt_user_shutdown_while_init_is_running.normal
1 FAILED TEST
If you say that running gtest_dlt_user without dlt-daemon is an invalid test case - it is OK for me :)
Hello @alexmohr , I believe the feature for thread safe was implemented by you: https://github.com/COVESA/dlt-daemon/commit/2224cddf140c060d92455ad3ee585e3abfc38eb2
Could you kindly have a look in this issue? Thank you so much
@minminlittleshrimp I don't think the issue is related to thread safety. But if there is no daemon running as @marcin-sochacki stated, we do leak file descriptors if logging mode is file. There might be other paths as well that leak file descriptors.
DltReturnValue dlt_init(void)
{
...
/* Check logging mode and internal log file is opened or not*/
if (logging_mode == DLT_LOG_TO_FILE && logging_handle == NULL) {
dlt_log_init(logging_mode);
}
/* Initialize common part of dlt_init()/dlt_init_file() */
if (dlt_init_common() == DLT_RETURN_ERROR) {
dlt_user_init_state = INIT_UNITIALIZED;
////////////////////
// --------> if logging mode is file, the FD we opened in dlt_log_init is never closed
////////////////////
return DLT_RETURN_ERROR;
}
...
}
I did some debugging and observed that the function dlt_initialize_socket_connection() is called in the context of the main thread from dlt_init() and in the context of the housekeeper thread from dlt_user_log_reattach_to_daemon()
Hello @lvklevankhanh Could you kindly look into this issue, it's been 2 months now? Thank you. DLT must be thread-safe implementation, any idea from contributors? Regards
Hello @Bichdao021195 This bug I think related to our refactoring locking mechanism in DLT. We already refactor by replacing signaling with locking mechanism, could you check again if the issue fixed yet? Regards
It looks, that the test case t_dlt_user_shutdown_while_init_is_running still fails, if you run this test longer (for example for 15 seconds) and if you build daemon with UNIX SOCKETS as IPC.
Edit tests/gtest_dlt_user.cpp
const auto max_runtime = std::chrono::seconds(15);
Build dlt-daemon and execute tests:
mkdir build
cd build
cmake -DWITH_DLT_UNIT_TESTS=ON -DDLT_IPC=UNIX_SOCKET ..
make -j16
make test
2/15 Test #2: gtest_dlt_user ............................***Failed 16.01 sec
I run the tests on bf947b30b96d53b2944106359e2224e2ea419074 from master branch on Ubuntu 22.04
Hello @marcin-sochacki Any update on this issue. I can reproduce the case, but just occasionally, with longer period (16 seconds). Still I have no clue how to handle this test case correctly.
Hello @marcin-sochacki Any update on this issue. I can reproduce the case, but just occasionally, with longer period (16 seconds). Still I have no clue how to handle this test case correctly.
I think the test case is correct, but the tested code isn't. There might be a user app startup/shutdown redesign required. You may want to implement a thread safe state machine and properly handle dlt_init() and dlt_free() in each state.
I understand that such changes always introduce a risk of introducing new issues, so you have to decide if you can live with this...
was the leaking fd when no connection is made to dlt-daemon (https://github.com/COVESA/dlt-daemon/issues/522#issuecomment-1684068925) fixed already? If not we probably should fix that first and see if this resolves the issue. Otherwise we can still re-work the thread safety issue. Ignoring this isn't a good idea as libdlt should be thread safe.