DPP icon indicating copy to clipboard operation
DPP copied to clipboard

Possible issue with thread management (Helgrind)

Open fclivaz42 opened this issue 4 months ago • 11 comments

Git commit reference d46bd98396609c83defff2da3227be383935d5ea (V10.1.3)

Describe the bug Running the basic ping/pong bot example through helgrind (valgrind --tool=helgrind) creates a gigantic log of data races and mutex/rwlock issues.

To Reproduce Steps to reproduce the behavior:

  1. Compile the base program shown at https://dpp.dev/firstbot.html, without bot.on_log(). (I personally used the flags -ldpp -std=c++23 -g3)
  2. Run valgrind --tool=helgrind ./a.out
  3. Observe!

Expected behavior Helgrind should not be outputting so many "possible data race", "write lock granted on mutex/rwlock which is currently wr-held by a different thread", and so on.

Screenshots Not applicable, however I can attach both a log file as well as the sample code (without the token heh) that produced this error.

System Details:

  • OS: Arch Linux
  • Discord Client used for testing: N/A

Additional context This seems to happen regardless of the OS. I also attempted it on Alpine Linux and i could recreate it as well. Oddly enough, -fsanitize=thread does not complain at all.

Here's the program that I used to recreate this issue:

#include <dpp/dpp.h>

const std::string BOT_TOKEN = "insert token here";

int main() {
	dpp::cluster bot(BOT_TOKEN);

	bot.on_slashcommand([&bot](const dpp::slashcommand_t& event) {
		if (event.command.get_command_name() == "stop") {
			event.reply("Pong!");
			bot.shutdown(); // shutting the bot down also seems to throw a million issues!
		}
	});

	bot.on_ready([&bot](const dpp::ready_t& event) {
		(void)event;
		if (dpp::run_once<struct register_bot_commands>()) {
			bot.global_command_create(dpp::slashcommand("stop", "stop!", bot.me.id));
		}
	});

	bot.start(dpp::st_wait);
}

And the attached Valgrind log:

helgrind.log

Sorry if this seems a bit hectic? I could not find another mention of helgrind in the issues, so I decided to open one myself.

Have a nice day!

fclivaz42 avatar Aug 14 '25 19:08 fclivaz42

is this related to an actual stability issue?

it runs fine for me in production with 217,000 servers, and the issues don't seem to be in our remit?

 
==97799== Thread #3: Bug in libpthread: write lock granted on mutex/rwlock which is currently wr-held by a different thread
==97799==    at 0x4868C2A: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==97799==    by 0x4C25E7D: ??? (in /usr/lib/libdpp.so.10.1.3)
==97799==    by 0x4E675A3: execute_native_thread_routine (thread.cc:104)
==97799==    by 0x51E79CA: ??? (in /usr/lib/libc.so.6)
==97799==    by 0x526B873: clone (in /usr/lib/libc.so.6)

braindigitalis avatar Aug 14 '25 20:08 braindigitalis

What do you mean that your issues don't seem to be in your remit? are you unable to replicate them?

fclivaz42 avatar Aug 14 '25 21:08 fclivaz42

it says it's a bug in libpthread, and whatever it is, isn't causing any issues in the real world. I have multi month uptimes, without deadlocks or any kind of issues?

braindigitalis avatar Aug 14 '25 22:08 braindigitalis

The reason I ended putting up an issue is because i attempted to write a small multithreaded program on my end to test if i could have the same errors as well, which i could not replicate. Are you getting a similar result on your end (mostly talking about the "possible datarace" part?)

Edit: this is what I am referring to:

==97799== Possible data race during write of size 8 at 0x5FD5F30 by thread #1
==97799== Locks held: none
==97799==    at 0x4C1D210: dpp::thread_pool::enqueue(dpp::thread_pool_task) (in /usr/lib/libdpp.so.10.1.3)
==97799==    by 0x4A620A1: dpp::cluster::queue_work(int, std::function<void ()>) (in /usr/lib/libdpp.so.10.1.3)
==97799==    by 0x4C04285: ??? (in /usr/lib/libdpp.so.10.1.3)
==97799==    by 0x4BE40A3: dpp::https_client::handle_buffer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (in /usr/lib/libdpp.so.10.1.3)
==97799==    by 0x4C1B178: dpp::ssl_connection::on_read(int, dpp::socket_events const&) (in /usr/lib/libdpp.so.10.1.3)
==97799==    by 0x4C3E304: ??? (in /usr/lib/libdpp.so.10.1.3)
==97799==    by 0x4A6FB71: dpp::cluster::start(dpp::start_type) (in /usr/lib/libdpp.so.10.1.3)
==97799==    by 0x4003B03: main (test.cpp:22)
==97799== 
==97799== This conflicts with a previous read of size 8 by thread #2
==97799== Locks held: none
==97799==    at 0x4C25E99: ??? (in /usr/lib/libdpp.so.10.1.3)
==97799==    by 0x4E675A3: execute_native_thread_routine (thread.cc:104)
==97799==    by 0x51E79CA: ??? (in /usr/lib/libc.so.6)
==97799==    by 0x526B873: clone (in /usr/lib/libc.so.6)
==97799==  Address 0x5fd5f30 is 32 bytes inside a block of size 152 alloc'd
==97799==    at 0x4860093: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==97799==    by 0x4A68DE7: dpp::cluster::cluster(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, unsigned int, unsigned int, unsigned int, bool, dpp::cache_policy_t, unsigned int) (in /usr/lib/libdpp.so.10.1.3)
==97799==    by 0x4003AAF: main (test.cpp:6)
==97799==  Block was alloc'd by thread #1

and

==97799== Thread #1 unlocked lock at 0x5FD5F48 currently held by thread #4
==97799==    at 0x4869369: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==97799==    by 0x4C1D02D: dpp::thread_pool::enqueue(dpp::thread_pool_task) (in /usr/lib/libdpp.so.10.1.3)
==97799==    by 0x4A620A1: dpp::cluster::queue_work(int, std::function<void ()>) (in /usr/lib/libdpp.so.10.1.3)
==97799==    by 0x4C04285: ??? (in /usr/lib/libdpp.so.10.1.3)
==97799==    by 0x4BE40A3: dpp::https_client::handle_buffer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&) (in /usr/lib/libdpp.so.10.1.3)
==97799==    by 0x4C1B178: dpp::ssl_connection::on_read(int, dpp::socket_events const&) (in /usr/lib/libdpp.so.10.1.3)
==97799==    by 0x4C3E304: ??? (in /usr/lib/libdpp.so.10.1.3)
==97799==    by 0x4A6FB71: dpp::cluster::start(dpp::start_type) (in /usr/lib/libdpp.so.10.1.3)
==97799==    by 0x4003B03: main (test.cpp:22)
==97799==  Lock at 0x5FD5F48 was first observed
==97799==    at 0x4868C2A: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==97799==    by 0x4C25E7D: ??? (in /usr/lib/libdpp.so.10.1.3)
==97799==    by 0x4E675A3: execute_native_thread_routine (thread.cc:104)
==97799==    by 0x51E79CA: ??? (in /usr/lib/libc.so.6)
==97799==    by 0x526B873: clone (in /usr/lib/libc.so.6)
==97799==  Address 0x5fd5f48 is 56 bytes inside a block of size 152 alloc'd
==97799==    at 0x4860093: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==97799==    by 0x4A68DE7: dpp::cluster::cluster(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, unsigned int, unsigned int, unsigned int, bool, dpp::cache_policy_t, unsigned int) (in /usr/lib/libdpp.so.10.1.3)
==97799==    by 0x4003AAF: main (test.cpp:6)
==97799==  Block was alloc'd by thread #1

fclivaz42 avatar Aug 14 '25 22:08 fclivaz42

You shouldn't run valgrind on a release build, compiler will insert code that would be bad C++ but is fine on the specific OS or architecture it's running on

Mishura4 avatar Aug 15 '25 13:08 Mishura4

Made this issue primarily because I had a worry that Helgrind was detecting something wrong but if you say that it's fine then we can close this?

Just as a side note, i did compile d++ with -g3 and -O0 before running helgrind on it again, which produced the same results. Is it still a non-issue then?

fclivaz42 avatar Aug 16 '25 14:08 fclivaz42

It's worth inspecting at least, though as @braindigitalis said he runs very active bots and hasn't had issues for months. With that said I don't know what he runs them on, x86 is much more lenient than other CPUs with byte-sized race conditions for example

Mishura4 avatar Aug 17 '25 01:08 Mishura4

Didn't mean to close this, i'm eepy

Mishura4 avatar Aug 17 '25 01:08 Mishura4

I was going to run my bots on aarch64 so if I run into anything I'll report it :)

fclivaz42 avatar Aug 18 '25 05:08 fclivaz42

Any update on this?

Jaskowicz1 avatar Oct 05 '25 09:10 Jaskowicz1

Any update on this?

Yes! During personal development i've come across the address sanitizer spitting out errors similar to this one:

==================
WARNING: ThreadSanitizer: data race (pid=56941)
  Write of size 8 at 0x721400025ef8 by thread T11 (mutexes: write M0):
    #0 operator delete(void*, unsigned long) /usr/src/debug/gcc/gcc/libsanitizer/tsan/tsan_new_delete.cpp:150 (libtsan.so.2+0xae867) (BuildId: 409aa50d055f3279712bcbbcbdcc0d76905a791b)
    #1 std::_Function_handler<void (nlohmann::json_abi_v3_11_2::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::json_abi_v3_11_2::adl_serializer, std::vector<unsigned char, std::allocator<unsigned char> > >&, dpp::http_request_completion_t const&), dpp::rest_request_list<dpp::slashcommand>(dpp::cluster*, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, dpp::http_method, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void (dpp::confirmation_callback_t const&)>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::{lambda(nlohmann::json_abi_v3_11_2::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::json_abi_v3_11_2::adl_serializer, std::vector<unsigned char, std::allocator<unsigned char> > >&, dpp::http_request_completion_t const&)#1}>::_M_manager(std::_Any_data&, std::_Any_data const&, std::_Manager_operation) <null> (libdpp.so.10.1.3+0x1ed4a0) (BuildId: f33365cbda041f01d1f0037780a79a28cc334899)

  Previous read of size 3 at 0x721400025ef8 by thread T5:
    #0 strlen /usr/src/debug/gcc/gcc/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:422 (libtsan.so.2+0x60925) (BuildId: 409aa50d055f3279712bcbbcbdcc0d76905a791b)
    #1 <null> <null> (libdpp.so.10.1.3+0x2b035e) (BuildId: f33365cbda041f01d1f0037780a79a28cc334899)

  Mutex M0 (0x723000000258) created at:
    #0 pthread_rwlock_rdlock /usr/src/debug/gcc/gcc/libsanitizer/tsan/tsan_interceptors_posix.cpp:1522 (libtsan.so.2+0x5c83c) (BuildId: 409aa50d055f3279712bcbbcbdcc0d76905a791b)
    #1 <null> <null> (libdpp.so.10.1.3+0x1b1b18) (BuildId: f33365cbda041f01d1f0037780a79a28cc334899)

  Thread T11 'event_loop' (tid=56958, running) created by main thread at:
    #0 pthread_create /usr/src/debug/gcc/gcc/libsanitizer/tsan/tsan_interceptors_posix.cpp:1041 (libtsan.so.2+0x5fb47) (BuildId: 409aa50d055f3279712bcbbcbdcc0d76905a791b)
    #1 __gthread_create(unsigned long*, void* (*)(void*), void*) /usr/src/debug/gcc/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:709 (libstdc++.so.6+0xe56a1) (BuildId: 203e750a7d28e57b5ca8a7435066659e6850bf0c)
    #2 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/thread.cc:172 (libstdc++.so.6+0xe56a1)
    #3 main srcs/main2.cpp:79 (optagon+0xb1d1) (BuildId: 935e88c24fce966d14afaa6202a8afeb402000e9)

  Thread T5 'pool/exec/4' (tid=56952, running) created by main thread at:
    #0 pthread_create /usr/src/debug/gcc/gcc/libsanitizer/tsan/tsan_interceptors_posix.cpp:1041 (libtsan.so.2+0x5fb47) (BuildId: 409aa50d055f3279712bcbbcbdcc0d76905a791b)
    #1 __gthread_create(unsigned long*, void* (*)(void*), void*) /usr/src/debug/gcc/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:709 (libstdc++.so.6+0xe56a1) (BuildId: 203e750a7d28e57b5ca8a7435066659e6850bf0c)
    #2 std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/thread.cc:172 (libstdc++.so.6+0xe56a1)
    #3 Loenn::Loenn(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, dpp::snowflake const&) srcs/Loenn/Loenn.cpp:37 (optagon+0x1a533) (BuildId: 935e88c24fce966d14afaa6202a8afeb402000e9)
    #4 main srcs/main2.cpp:78 (optagon+0xb1b2) (BuildId: 935e88c24fce966d14afaa6202a8afeb402000e9)

SUMMARY: ThreadSanitizer: data race (/usr/lib/libdpp.so.10.1.3+0x1ed4a0) (BuildId: f33365cbda041f01d1f0037780a79a28cc334899) in std::_Function_handler<void (nlohmann::json_abi_v3_11_2::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::json_abi_v3_11_2::adl_serializer, std::vector<unsigned char, std::allocator<unsigned char> > >&, dpp::http_request_completion_t const&), dpp::rest_request_list<dpp::slashcommand>(dpp::cluster*, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, dpp::http_method, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void (dpp::confirmation_callback_t const&)>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::{lambda(nlohmann::json_abi_v3_11_2::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, unsigned long, double, std::allocator, nlohmann::json_abi_v3_11_2::adl_serializer, std::vector<unsigned char, std::allocator<unsigned char> > >&, dpp::http_request_completion_t const&)#1}>::_M_manager(std::_Any_data&, std::_Any_data const&, std::_Manager_operation)
==================

it mostly seems to happen right after the shards are started, I really can't provide much clarification on this as sometimes it happens, sometimes it just... doesn't.

fclivaz42 avatar Oct 21 '25 07:10 fclivaz42