Possible issue with thread management (Helgrind)
Git commit reference d46bd98396609c83defff2da3227be383935d5ea (V10.1.3)
Describe the bug Running the basic ping/pong bot example through helgrind (valgrind --tool=helgrind) creates a gigantic log of data races and mutex/rwlock issues.
To Reproduce Steps to reproduce the behavior:
- Compile the base program shown at https://dpp.dev/firstbot.html, without
bot.on_log(). (I personally used the flags-ldpp -std=c++23 -g3) - Run
valgrind --tool=helgrind ./a.out - Observe!
Expected behavior Helgrind should not be outputting so many "possible data race", "write lock granted on mutex/rwlock which is currently wr-held by a different thread", and so on.
Screenshots Not applicable, however I can attach both a log file as well as the sample code (without the token heh) that produced this error.
System Details:
- OS: Arch Linux
- Discord Client used for testing: N/A
Additional context This seems to happen regardless of the OS. I also attempted it on Alpine Linux and i could recreate it as well. Oddly enough, -fsanitize=thread does not complain at all.
Here's the program that I used to recreate this issue:
#include <dpp/dpp.h>
const std::string BOT_TOKEN = "insert token here";
int main() {
dpp::cluster bot(BOT_TOKEN);
bot.on_slashcommand([&bot](const dpp::slashcommand_t& event) {
if (event.command.get_command_name() == "stop") {
event.reply("Pong!");
bot.shutdown(); // shutting the bot down also seems to throw a million issues!
}
});
bot.on_ready([&bot](const dpp::ready_t& event) {
(void)event;
if (dpp::run_once<struct register_bot_commands>()) {
bot.global_command_create(dpp::slashcommand("stop", "stop!", bot.me.id));
}
});
bot.start(dpp::st_wait);
}
And the attached Valgrind log:
Sorry if this seems a bit hectic? I could not find another mention of helgrind in the issues, so I decided to open one myself.
Have a nice day!
is this related to an actual stability issue?
it runs fine for me in production with 217,000 servers, and the issues don't seem to be in our remit?
==97799== Thread #3: Bug in libpthread: write lock granted on mutex/rwlock which is currently wr-held by a different thread
==97799== at 0x4868C2A: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==97799== by 0x4C25E7D: ??? (in /usr/lib/libdpp.so.10.1.3)
==97799== by 0x4E675A3: execute_native_thread_routine (thread.cc:104)
==97799== by 0x51E79CA: ??? (in /usr/lib/libc.so.6)
==97799== by 0x526B873: clone (in /usr/lib/libc.so.6)
What do you mean that your issues don't seem to be in your remit? are you unable to replicate them?
it says it's a bug in libpthread, and whatever it is, isn't causing any issues in the real world. I have multi month uptimes, without deadlocks or any kind of issues?
The reason I ended putting up an issue is because i attempted to write a small multithreaded program on my end to test if i could have the same errors as well, which i could not replicate. Are you getting a similar result on your end (mostly talking about the "possible datarace" part?)
Edit: this is what I am referring to:
==97799== Possible data race during write of size 8 at 0x5FD5F30 by thread #1
==97799== Locks held: none
==97799== at 0x4C1D210: dpp::thread_pool::enqueue(dpp::thread_pool_task) (in /usr/lib/libdpp.so.10.1.3)
==97799== by 0x4A620A1: dpp::cluster::queue_work(int, std::function<void ()>) (in /usr/lib/libdpp.so.10.1.3)
==97799== by 0x4C04285: ??? (in /usr/lib/libdpp.so.10.1.3)
==97799== by 0x4BE40A3: dpp::https_client::handle_buffer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (in /usr/lib/libdpp.so.10.1.3)
==97799== by 0x4C1B178: dpp::ssl_connection::on_read(int, dpp::socket_events const&) (in /usr/lib/libdpp.so.10.1.3)
==97799== by 0x4C3E304: ??? (in /usr/lib/libdpp.so.10.1.3)
==97799== by 0x4A6FB71: dpp::cluster::start(dpp::start_type) (in /usr/lib/libdpp.so.10.1.3)
==97799== by 0x4003B03: main (test.cpp:22)
==97799==
==97799== This conflicts with a previous read of size 8 by thread #2
==97799== Locks held: none
==97799== at 0x4C25E99: ??? (in /usr/lib/libdpp.so.10.1.3)
==97799== by 0x4E675A3: execute_native_thread_routine (thread.cc:104)
==97799== by 0x51E79CA: ??? (in /usr/lib/libc.so.6)
==97799== by 0x526B873: clone (in /usr/lib/libc.so.6)
==97799== Address 0x5fd5f30 is 32 bytes inside a block of size 152 alloc'd
==97799== at 0x4860093: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==97799== by 0x4A68DE7: dpp::cluster::cluster(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, unsigned int, unsigned int, unsigned int, bool, dpp::cache_policy_t, unsigned int) (in /usr/lib/libdpp.so.10.1.3)
==97799== by 0x4003AAF: main (test.cpp:6)
==97799== Block was alloc'd by thread #1
and
==97799== Thread #1 unlocked lock at 0x5FD5F48 currently held by thread #4
==97799== at 0x4869369: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==97799== by 0x4C1D02D: dpp::thread_pool::enqueue(dpp::thread_pool_task) (in /usr/lib/libdpp.so.10.1.3)
==97799== by 0x4A620A1: dpp::cluster::queue_work(int, std::function<void ()>) (in /usr/lib/libdpp.so.10.1.3)
==97799== by 0x4C04285: ??? (in /usr/lib/libdpp.so.10.1.3)
==97799== by 0x4BE40A3: dpp::https_client::handle_buffer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (in /usr/lib/libdpp.so.10.1.3)
==97799== by 0x4C1B178: dpp::ssl_connection::on_read(int, dpp::socket_events const&) (in /usr/lib/libdpp.so.10.1.3)
==97799== by 0x4C3E304: ??? (in /usr/lib/libdpp.so.10.1.3)
==97799== by 0x4A6FB71: dpp::cluster::start(dpp::start_type) (in /usr/lib/libdpp.so.10.1.3)
==97799== by 0x4003B03: main (test.cpp:22)
==97799== Lock at 0x5FD5F48 was first observed
==97799== at 0x4868C2A: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==97799== by 0x4C25E7D: ??? (in /usr/lib/libdpp.so.10.1.3)
==97799== by 0x4E675A3: execute_native_thread_routine (thread.cc:104)
==97799== by 0x51E79CA: ??? (in /usr/lib/libc.so.6)
==97799== by 0x526B873: clone (in /usr/lib/libc.so.6)
==97799== Address 0x5fd5f48 is 56 bytes inside a block of size 152 alloc'd
==97799== at 0x4860093: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==97799== by 0x4A68DE7: dpp::cluster::cluster(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, unsigned int, unsigned int, unsigned int, bool, dpp::cache_policy_t, unsigned int) (in /usr/lib/libdpp.so.10.1.3)
==97799== by 0x4003AAF: main (test.cpp:6)
==97799== Block was alloc'd by thread #1
You shouldn't run valgrind on a release build, compiler will insert code that would be bad C++ but is fine on the specific OS or architecture it's running on
Made this issue primarily because I had a worry that Helgrind was detecting something wrong but if you say that it's fine then we can close this?
Just as a side note, i did compile d++ with -g3 and -O0 before running helgrind on it again, which produced the same results. Is it still a non-issue then?
It's worth inspecting at least, though as @braindigitalis said he runs very active bots and hasn't had issues for months. With that said I don't know what he runs them on, x86 is much more lenient than other CPUs with byte-sized race conditions for example
Didn't mean to close this, i'm eepy
I was going to run my bots on aarch64 so if I run into anything I'll report it :)
Any update on this?
Any update on this?
Yes! During personal development i've come across the address sanitizer spitting out errors similar to this one:
==================
WARNING: ThreadSanitizer: data race (pid=56941)
Write of size 8 at 0x721400025ef8 by thread T11 (mutexes: write M0):
#0 operator delete(void*, unsigned long) /usr/src/debug/gcc/gcc/libsanitizer/tsan/tsan_new_delete.cpp:150 (libtsan.so.2+0xae867) (BuildId: 409aa50d055f3279712bcbbcbdcc0d76905a791b)
#1 std::_Function_handler<void (nlohmann::json_abi_v3_11_2::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::json_abi_v3_11_2::adl_serializer, std::vector<unsigned char, std::allocator<unsigned char> > >&, dpp::http_request_completion_t const&), dpp::rest_request_list<dpp::slashcommand>(dpp::cluster*, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, dpp::http_method, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void (dpp::confirmation_callback_t const&)>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::{lambda(nlohmann::json_abi_v3_11_2::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::json_abi_v3_11_2::adl_serializer, std::vector<unsigned char, std::allocator<unsigned char> > >&, dpp::http_request_completion_t const&)#1}>::_M_manager(std::_Any_data&, std::_Any_data const&, std::_Manager_operation) <null> (libdpp.so.10.1.3+0x1ed4a0) (BuildId: f33365cbda041f01d1f0037780a79a28cc334899)
Previous read of size 3 at 0x721400025ef8 by thread T5:
#0 strlen /usr/src/debug/gcc/gcc/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:422 (libtsan.so.2+0x60925) (BuildId: 409aa50d055f3279712bcbbcbdcc0d76905a791b)
#1 <null> <null> (libdpp.so.10.1.3+0x2b035e) (BuildId: f33365cbda041f01d1f0037780a79a28cc334899)
Mutex M0 (0x723000000258) created at:
#0 pthread_rwlock_rdlock /usr/src/debug/gcc/gcc/libsanitizer/tsan/tsan_interceptors_posix.cpp:1522 (libtsan.so.2+0x5c83c) (BuildId: 409aa50d055f3279712bcbbcbdcc0d76905a791b)
#1 <null> <null> (libdpp.so.10.1.3+0x1b1b18) (BuildId: f33365cbda041f01d1f0037780a79a28cc334899)
Thread T11 'event_loop' (tid=56958, running) created by main thread at:
#0 pthread_create /usr/src/debug/gcc/gcc/libsanitizer/tsan/tsan_interceptors_posix.cpp:1041 (libtsan.so.2+0x5fb47) (BuildId: 409aa50d055f3279712bcbbcbdcc0d76905a791b)
#1 __gthread_create(unsigned long*, void* (*)(void*), void*) /usr/src/debug/gcc/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:709 (libstdc++.so.6+0xe56a1) (BuildId: 203e750a7d28e57b5ca8a7435066659e6850bf0c)
#2 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/thread.cc:172 (libstdc++.so.6+0xe56a1)
#3 main srcs/main2.cpp:79 (optagon+0xb1d1) (BuildId: 935e88c24fce966d14afaa6202a8afeb402000e9)
Thread T5 'pool/exec/4' (tid=56952, running) created by main thread at:
#0 pthread_create /usr/src/debug/gcc/gcc/libsanitizer/tsan/tsan_interceptors_posix.cpp:1041 (libtsan.so.2+0x5fb47) (BuildId: 409aa50d055f3279712bcbbcbdcc0d76905a791b)
#1 __gthread_create(unsigned long*, void* (*)(void*), void*) /usr/src/debug/gcc/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:709 (libstdc++.so.6+0xe56a1) (BuildId: 203e750a7d28e57b5ca8a7435066659e6850bf0c)
#2 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/thread.cc:172 (libstdc++.so.6+0xe56a1)
#3 Loenn::Loenn(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, dpp::snowflake const&) srcs/Loenn/Loenn.cpp:37 (optagon+0x1a533) (BuildId: 935e88c24fce966d14afaa6202a8afeb402000e9)
#4 main srcs/main2.cpp:78 (optagon+0xb1b2) (BuildId: 935e88c24fce966d14afaa6202a8afeb402000e9)
SUMMARY: ThreadSanitizer: data race (/usr/lib/libdpp.so.10.1.3+0x1ed4a0) (BuildId: f33365cbda041f01d1f0037780a79a28cc334899) in std::_Function_handler<void (nlohmann::json_abi_v3_11_2::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::json_abi_v3_11_2::adl_serializer, std::vector<unsigned char, std::allocator<unsigned char> > >&, dpp::http_request_completion_t const&), dpp::rest_request_list<dpp::slashcommand>(dpp::cluster*, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, dpp::http_method, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void (dpp::confirmation_callback_t const&)>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::{lambda(nlohmann::json_abi_v3_11_2::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::json_abi_v3_11_2::adl_serializer, std::vector<unsigned char, std::allocator<unsigned char> > >&, dpp::http_request_completion_t const&)#1}>::_M_manager(std::_Any_data&, std::_Any_data const&, std::_Manager_operation)
==================
it mostly seems to happen right after the shards are started, I really can't provide much clarification on this as sometimes it happens, sometimes it just... doesn't.