Rare memory management failures in FM-index in multi-threadding context
Does this problem persist on the current main?
- [x] I have verified the issue on the current main
Is there an existing issue for this?
- [x] I have searched the existing issues
Current Behavior
In rare cases we find failures that look like this:
libs/seqan3/submodules/sdsl-lite/include/sdsl/int_vector.hpp:1820:
sdsl::int_vector<<anonymous> >::reference
sdsl::int_vector<<anonymous> >::operator[](const size_type&)
[with unsigned char t_width = 0;
reference = sdsl::int_vector_reference<sdsl::int_vector<0> >;
size_type = long unsigned int]:
Assertion idx < this->size()' failed.
Aborted (core dumped)
We are using the FM-index in a multi-thread context where each thread creates a separate instance of the index. Our understanding that the global singleton pattern currently does not handle critical parts of the memory management in a thread-safe manner.
Expected Behavior
We would like to use the FM index in a multi-threadded context.
Steps To Reproduce
We do not have a small test to reproduce the issue. Creating a number of threads constantly creating and destroying FM-indices should trigger the failure.
We have successfully tested the following patch:
- static memory_manager m;
+ static thread_local memory_manager m;
on seqan3/submodules/sdsl-lite/include/sdsl/memory_management.hpp line 776.
Environment
- Operating system: ubuntu:22.04
- SeqAn version: 3.3.0
- Compiler: gcc-12
Anything else?
As the failure is rare, we cannot be sure if our patch is 100% effective, but we have not seen issues in the test we ran so far. So the tests are successful in the sense that the application is compiling and was run on about 50 cases. We saw a failure rate of about 1/500, so we will only be able to confirm that this working well after a couple of weeks. Therefore, your assessment of the issue and the solution would be valuable.
Hey @behrj,
thanks for reporting this!
I did some digging.
Here's my example code:
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/search/fm_index/fm_index.hpp>
int main()
{
using namespace seqan3::literals;
static constexpr size_t threads{8};
std::vector<seqan3::dna4> genome{"ATCGATCGAAGGCTAGCTAGCTAAGGGA"_dna4};
#pragma omp parallel for num_threads(threads)
for (size_t i = 0; i < threads; ++i)
{
seqan3::fm_index index{genome};
auto cur = index.cursor();
cur.extend_right("AAGG"_dna4);
[[maybe_unused]] size_t count = cur.count();
}
return 0;
}
I used the current seqan3 main branch and the SDSL of seqan3 3.3.0.
Compiled with clang-21¹, debug mode with -fsanitize=thread.
I got a data race in sdsl::util::_id_helper().
Before that, I got a data race in sdsl::ram_filebuf but this one was a side effect of the id helper. I silenced it by adding thread_local to static memory_monitor m;. After fixing the id helper, this isn't needed anymore.
The id helper was fixed in https://github.com/xxsds/sdsl-lite/pull/126 and it's included in seqan3 3.4.0. Basically, the id helper is meant for returning increasing IDs and used for cache/ram_fs/etc. Because the counter wasn't atomic, multiple threads could use the same ID, and subsequently allocate/deallocate/access the same data.
Switching to seqan 3.4.0 shouldn't break anything, but our package management changed a bit (See release notes Notes for package maintainers).
But it might be easier to try out the patch and see if it's working without adding thread_local to the memory_manager.
¹ Clang's thread sanitizer can handle OpenMP with libarcher:
export OMP_TOOL_LIBRARIES=/usr/lib/llvm-21/lib/libarcher.so
additionally,
export LD_LIBRARY_PATH="/usr/lib/llvm-21/lib"
may be needed
Hi @eseiler, thanks a lot for looking into this. I am aware of the issue with the id helper. We have been working on this together before: https://github.com/seqan/seqan3/issues/3258
I currently still manage that as a patch on 3.3.0 in our repo and it has reduced our failure rate by a lot. We plan to switch to 3.4.0 soon, so we can stop maintaining this patch. But this issue I reported today is still happening with the patch but at a much lower rate than without the patch.
Having a look at your example code, I think it will probably not trigger the memory management issue because I think this will primarily happen when one thread makes a request that causes the memory manager to re-allocate, while another thread asks for memory that would not cause re-allocation. So my guess would be that to trigger this the requests would need to vary in size and some would need to be large enough to require re-allocation.
Alright, then it is indeed very hard to trigger (also with different sizes).
And with thread_local, the problem goes away (as yet)?
You did patch this one, correct? Should be line 838 instead of 776, no?
static memory_manager & the_manager()
{
static memory_manager m;
return m;
}
I don't mind adding thread_local there. But I'm not sure why this fixes things.
The memory_manager is basically static, memory_monitor at least has some states...
Exactly, this is where the patch goes. In 3.3.0 it is 776, but sure, you found the right spot. My understanding is that thread_local will cause this to create one memoy_manager per thread, still following the singleton pattern, but avoiding all the thread safety code and the performance penalty that might come with it. I don't think that people should be affected by this as currently this doesn't seem to be commonly used in a multi-thread context, otherwise I would not be the only guy running into this stuff.
We are about to run more extensive tests on this, so in a couple of days I can tell you more regarding the effectiveness of this fix.