OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

Support arbitrary numbers of threads for memory allocation.

Open oon3m0oo opened this issue 7 years ago • 35 comments

When I originally refactored memory.c to reduce locking, I made the (incorrect) assumption that all threads were managed by OpenBLAS. The recent Issues we've seen (#1735) show that really, any caller can make its own threads and call into OpenBLAS; they don't all come from blas_server. Thus we have to be able to support an arbitrary number of threads that can come in any time. The original implementation (before my changes) dealt with this by having a single allocation table, and everyone had to lock to get access to and update it, which was expensive. Moving to thread-local allocation tables was much faster, but now we have to deal with the fact that thread local storage might not be cleaned up.

This change gives each thread its own local allocation table, and completely does away with the global table. We cleanup allocations using pthreads' key destructor and Win32's DLL_THREAD_DETACH.

This change also removes compiler TLS, which in the end, wasn't really worth it given the issues with the glibc implementation (#1720). The overall performance impact was < 1%, anyway. Removing it also simplifies the code.

oon3m0oo avatar Aug 16 '18 14:08 oon3m0oo

Hmm... I'm not sure why Windows suddenly can't find ::Tls*.

oon3m0oo avatar Aug 16 '18 15:08 oon3m0oo

Checking if this is a problem with the CI - local Linux builds show no errors so far.

martin-frbg avatar Aug 16 '18 18:08 martin-frbg

I wonder if the Windows build issue comes simply from using C++ idiom in a plain C file (which would also imply the problem was in your earlier code as well, just masked by the default use of compiler TLS).

martin-frbg avatar Aug 17 '18 07:08 martin-frbg

Aha, you're probably right. I've removed the ::; let's see what happens.

In the meantime I can reproduce the test failures and am actively debugging.

oon3m0oo avatar Aug 17 '18 08:08 oon3m0oo

Yep, that was it. I've fixed the tests, too. The failures are make timeouts (45 minutes). Is there a way I can restart them?

oon3m0oo avatar Aug 17 '18 12:08 oon3m0oo

Not sure if you as the submitter already have the right to restart individual jobs, or perhaps would need to set up a free account with travis. (you should see a symbol with two concetric circles to the right of the time column in the travis jobs table if you can restart jobs by clicking this). I have now restarted the three jobs in question, but the issue with the timeouts could only be resolved by changing the limit in the .travis.yml file. However, these three jobs used to take around twenty minutes like the others, so "make timeout" here probably means a lockup in the tests. (We will see...)

martin-frbg avatar Aug 17 '18 12:08 martin-frbg

I've tried all three locally and everything was fine... not sure what's happening with travis.

oon3m0oo avatar Aug 17 '18 15:08 oon3m0oo

I cannot reproduce this either. Travis status blog has a few recent entries about jobs that require "sudo: true" intermittently failing or hanging. It is just worrying that these three jobs should fail repeatedly...

martin-frbg avatar Aug 17 '18 15:08 martin-frbg

Yeah, it is. Is there any way to get more verbose output of those jobs?

oon3m0oo avatar Aug 17 '18 15:08 oon3m0oo

I could remove the settings for "quiet make" from the travis.yml, but ironically travis used to abort jobs for creating too much output without it.

martin-frbg avatar Aug 17 '18 15:08 martin-frbg

Trying some things in my fork now. But without "QUIET", travis tends to kill the job as it exceeds 4MB of output (mostly from warnings in the LAPACK part)...

martin-frbg avatar Aug 17 '18 16:08 martin-frbg

Travis also kills the absolutely silent jobs... make -s ? I think it barfs out less than a megabyte now.

brada4 avatar Aug 17 '18 18:08 brada4

Turns out it is the fork safety utest that is hanging.

martin-frbg avatar Aug 18 '18 10:08 martin-frbg

Linux timeout command can do it, no posix equivalent, but particular test is only linux...

brada4 avatar Aug 18 '18 10:08 brada4

@brada4 this does not answer the question why the fork test fails with this change.

martin-frbg avatar Aug 18 '18 10:08 martin-frbg

Just to fix up CI

brada4 avatar Aug 18 '18 13:08 brada4

The hang is reproducible with (a vm running) the same version of Ubuntu locally, so probably related to either glibc 2.15 or gcc 4.6.3 . gdb shows one thread in __libc_thread_freeres() called from clone() via start_thread, with the other thread in pthread_join called from blas_thread_shutdown at line 982 of blas_server.c. Not sure yet why this would happen only after this change, when the only relevant difference appears to be in blas_memory_cleanup that happens after the call to blas_thread_shutdown.

martin-frbg avatar Aug 18 '18 18:08 martin-frbg

helgrind sees a race between calls to blas_memory_alloc from memory.c:1071 (reading with no lock held, called from dgemm, gemm.c:403) and memory.c:1097 (writing with a lock held), and another between get_memory_table at memory.c:489 and pthread_key_create called from memory.c:1031

martin-frbg avatar Aug 18 '18 19:08 martin-frbg

Seems access to the newly introduced local_storage_key variable is racey, but adding another lock to protect it does not help.

martin-frbg avatar Aug 18 '18 22:08 martin-frbg

I believe I fixed the race conditions in the "mem" branch of my fork (at the expense of another mutex lock), but the issue with travis looks like a glibc < 2.23 bug. If I move the CI to Ubuntu xenial, the utest passes. The question now becomes how to handle these older systems: An #if !defined(__GLIBC_PREREQ) || __GLIBC_PREREQ(2,23) for the "new" code is easy enough, but what to fall back to in the #else branch - pre-0.3.1 memory.c, or the 0.3.2 memory.c without compiler TLS and with an increased MAX_ALLOCATING_THREADS ?

martin-frbg avatar Aug 19 '18 20:08 martin-frbg

I've added a tiny bit of locking that ought to resolve things... we'll see. I suppose it depends on what the issue with glibc is. The change in logic here might be enough to avoid the issue altogether (crosses fingers).

oon3m0oo avatar Aug 20 '18 11:08 oon3m0oo

My impression was that it was somehow related to using malloc in the threads, but I do not have the bug ids handy now. We'll see what travis says now, and I'll also give it a spin with the ubuntu vm later when I have time.

martin-frbg avatar Aug 20 '18 11:08 martin-frbg

Does not look good unfortunately, I see local segfaults (at memory.c:1128) even with relatively current systems as well. I must admit I am beginning to think we should revert to the 0.3.0 memory.c for 0.3.3 to give users a stable release to work with, and reintroduce the rewrite when it is stabilized and tested.

martin-frbg avatar Aug 20 '18 12:08 martin-frbg

@martin-frbg what tests are you running?

oon3m0oo avatar Aug 20 '18 13:08 oon3m0oo

Normal build (on opensuse leap 42.2, gcc-4.8.5, glibc 2.22) was enough to trigger a segfault in the sblat2 test with your latest version...

martin-frbg avatar Aug 20 '18 15:08 martin-frbg

All I have is 2.24 but I can't reproduce any crashes. Everything seems to run fine, which makes all of this more complicated. I'm trying to see if there's a good way for me to get something were I can reproduce this, but if you have any ideas, I'd be grateful.

oon3m0oo avatar Aug 20 '18 15:08 oon3m0oo

Virtualbox with a bunch of vm's from osboxes.org or similar ? Also running valgrind --tool=helgrind (or just the default memcheck mode valgrind) on the blas tests and/or the utest binary is often helpful.

martin-frbg avatar Aug 20 '18 16:08 martin-frbg

BTW the sblat2 test crashes with alloc_table=0x0 (making alloc_table[0] an illegal access) in line 1128 of memory.c

martin-frbg avatar Aug 20 '18 17:08 martin-frbg

Seems to me the crucial problem is that your revised code is missing a call to blas_tls_init() at the start of blas_memory_alloc() (or the static int tls_initialize there is not working as intended for whatever reason). Net effect seems to be that get_memory_table gets invoked with an undefined local_storage_key.

martin-frbg avatar Aug 22 '18 07:08 martin-frbg

Actually it gets called from blas_memory_init, from blas_memory_alloc. And somehow that line is gone. I'll add it back in.

oon3m0oo avatar Aug 22 '18 08:08 oon3m0oo