hdf5 Refactor threading and other concurrency support

Complete overhaul of the concurrency-related aspects of the library (threading, atomics, locking, etc.), adding private routines in the H5TS package to allow internal algorithms to use all of these capabilities.

Adds a configure option (--enable-threads / HDF5_ENABLE_THREADS), enabled by default, to control whether threading is enabled within the library.

Now supports C11, pthreads, and Windows threading for all H5TS capabilities, except the recursive readers/writers lock, which is not supported on Windows (because Windows threads don't provide a callback on thread-local variable deletion).

Removes the mercury threading code.

May 09 '24 16:05 qkoziol

My understanding from reading the Win32 docs and stackoverflow is that if we switch to using Fls* (fiber local storage) functions instead of Tls* we'll be able to use the thread destructor callback to free the local storage, and it will work fine with fibers or threads, as long as the calling application doesn't mix the two, though I suppose if we create threads internally with H5TS_thread_create then this might not work if the application uses fibers (I'm not certain on this).

May 20 '24 22:05 fortnern

Alternatively I suppose we could create a global list of TLS values to free, and free them in the atexit callback. It wouldn't be hard to make the list lockfree, since only additions need to be threadsafe. Of course this means memory growth if we end up creating and destroying lots of mutexes during runtime (do we anticipate doing that, or creating each mutex for the life of the program?)

May 20 '24 22:05 fortnern

My understanding is fibers are like threads but with different scheduling semantics

May 20 '24 22:05 fortnern

My understanding from reading the Win32 docs and stackoverflow is that if we switch to using Fls* (fiber local storage) functions instead of Tls* we'll be able to use the thread destructor callback to free the local storage, and it will work fine with fibers or threads, as long as the calling application doesn't mix the two, though I suppose if we create threads internally with H5TS_thread_create then this won't work if the application uses fibers.

Cool - I wasn't aware of those. I'll take a look tonight and see if they will work.

May 20 '24 23:05 qkoziol

https://devblogs.microsoft.com/oldnewthing/20191011-00/?p=102989

May 20 '24 23:05 fortnern

https://devblogs.microsoft.com/oldnewthing/20191011-00/?p=102989

I'm reading https://learn.microsoft.com/en-us/windows/win32/procthread/fibers and have some concerns about using fiber local storage.

What happens (in the probably rare case) when a Windows application that uses multiple fibers within a thread calls into the library from different fibers? (e.g. the example in the MS fiber docs that uses I/O operations from multiple fibers) It seems like we would allocate another fiber local value for each fiber from a thread.

Having fiber local storage on each fiber, but mutexes, condition variables, etc. per-thread, looks like a recipe for mismatch and trouble. Hypothetically, one fiber could enter the library, acquire the API lock, make a callback that switched to another fiber, and look like another thread to the library, which would not obey the reentrancy conditions I outline in the RFC I've circulated. Probably extremely rare, but definitely possible.

In addition, the thread has to be converted to a fiber first, via ConvertThreadToFiber. This seems to be a requirement: https://stackoverflow.com/questions/19613016/fls-vs-tls-can-i-use-fiber-local-storage-in-place-of-tls . I don't think it's a good idea to do that within the library.

I'm inclined to not worry about the limit on thread local storage for Windows, since the Visual Studio compiler has supported C11 threads for several years and the limit should go away whenever we switch to requiring C11 compliant compilers.

May 21 '24 02:05 qkoziol

In addition, the thread has to be converted to a fiber first, via ConvertThreadToFiber. This seems to be a requirement: https://stackoverflow.com/questions/19613016/fls-vs-tls-can-i-use-fiber-local-storage-in-place-of-tls . I don't think it's a good idea to do that within the library.

The "top" answer says you actually don't need to convert the thread to a fiber, regardless of what the docs say. Also, it's worth noting that the reply that says you do need to convert it to a fiber is from Raymond Chen, who I assume is the same person who wrote the blog post I mentioned. You'd think he would've mentioned that requirement in the blog post. The SO reply is from 2013 and the blog post is 2019, maybe he realized it wasn't actually necessary in the interim?

At any rate, I didn't catch that the windows implementation will work if C11 atomics/threading are supported, so it's probably fine as is, we'll just want to add a configure time failure if thread concurrency is requested and netiher C11 nor pthreads are supported (when thread concurrency is added, assuming it requires the recursive rw lock).

May 21 '24 14:05 fortnern

In addition, the thread has to be converted to a fiber first, via ConvertThreadToFiber. This seems to be a requirement: https://stackoverflow.com/questions/19613016/fls-vs-tls-can-i-use-fiber-local-storage-in-place-of-tls . I don't think it's a good idea to do that within the library.

The "top" answer says you actually don't need to convert the thread to a fiber, regardless of what the docs say. Also, it's worth noting that the reply that says you do need to convert it to a fiber is from Raymond Chen, who I assume is the same person who wrote the blog post I mentioned. You'd think he would've mentioned that requirement in the blog post. The SO reply is from 2013 and the blog post is 2019, maybe he realized it wasn't actually necessary in the interim?

Yes, I saw the discussion, but it didn't seem like a good idea to depend on undocumented / unsupported behavior.

At any rate, I didn't catch that the windows implementation will work if C11 atomics/threading are supported, so it's probably fine as is, we'll just want to add a configure time failure if thread concurrency is requested and netiher C11 nor pthreads are supported (when thread concurrency is added, assuming it requires the recursive rw lock).

I currently believe that the recursive R/W lock is not going to solve the problems we will have, so I'm not worried about it being unavailable on Windows. If the rec R/W locks turn out to be necessary, I'll either add a configure failure or work on the thread local storage for Windows in some way.

May 21 '24 14:05 qkoziol

Still debugging, but I can't seem to enable subfiling with these changes. CMake always fails when trying to find MPI_Comm_split_type from MPI. Reverting the patch I can enable the VFD again.

May 24 '24 17:05 jhendersonHDF

Still debugging, but I can't seem to enable subfiling with these changes. CMake always fails when trying to find MPI_Comm_split_type from MPI. Reverting the patch I can enable the VFD again.

OK, thanks for checking - subfiling is working for me with the autotools builds.

May 24 '24 18:05 qkoziol

Also, just a thought but it might be nice if the threading tests all printed out what threading model is being used for debugging purposes.

May 24 '24 18:05 jhendersonHDF

Also, just a thought but it might be nice if the threading tests all printed out what threading model is being used for debugging purposes.

That's already in the configure output, but no harm in adding it here too.

May 24 '24 18:05 qkoziol

Also, just a thought but it might be nice if the threading tests all printed out what threading model is being used for debugging purposes.

That's already in the configure output, but no harm in adding it here too.

True, I wouldn't necessarily worry about it for this PR but it would be convenient to know right at a glance.

May 24 '24 19:05 jhendersonHDF

Also, just a thought but it might be nice if the threading tests all printed out what threading model is being used for debugging purposes.

That's already in the configure output, but no harm in adding it here too.

True, I wouldn't necessarily worry about it for this PR but it would be convenient to know right at a glance.

I added some info to the output for the ttsafe test, although it's now a little non-standard compared to the rest of the tests.

May 25 '24 00:05 qkoziol

Still debugging, but I can't seem to enable subfiling with these changes. CMake always fails when trying to find MPI_Comm_split_type from MPI. Reverting the patch I can enable the VFD again.

OK, thanks for checking - subfiling is working for me with the autotools builds.

@jhendersonHDF - I just noticed a place in the subfiling CMake configuration that I missed earlier. Can you please try again now?

May 28 '24 23:05 qkoziol

@jhendersonHDF - I just noticed a place in the subfiling CMake configuration that I missed earlier. Can you please try again now?

Still running into the same issue, but I believe I have a fix that just involves rearranging some lines in the CMake code. Will get a patch to you later in the afternoon.

May 29 '24 16:05 jhendersonHDF

@jhendersonHDF - I just noticed a place in the subfiling CMake configuration that I missed earlier. Can you please try again now?

Still running into the same issue, but I believe I have a fix that just involves rearranging some lines in the CMake code. Will get a patch to you later in the afternoon.

Cool , thanks!

May 29 '24 16:05 qkoziol

Frontier subfiling performance results: the 64 node case was smaller, but the 128 and 256 cases were larger but the same size. Each test was run a minimum of 5 times. I also encounter a seg fault in one of the runs at 256 nodes using the PR version

PR.pdf

Jun 03 '24 22:06 brtnfld

Needs to address performance regression.

Thanks for running these tests, I'll start looking into the performance tomorrow.

Jun 04 '24 01:06 qkoziol

@brtnfld - Before I get too far down the path to profiling, etc., could you please do a performance test of subfiling on this branch: https://github.com/qkoziol/hdf5/tree/non_rec_rwlock ? I've updated the thread pool to improve latency and I'm moderately hopeful that performance also improved.

Jun 04 '24 14:06 qkoziol

@brtnfld - Before I get too far down the path to profiling, etc., could you please do a performance test of subfiling on this branch: https://github.com/qkoziol/hdf5/tree/non_rec_rwlock ? I've updated the thread pool to improve latency and I'm moderately hopeful that performance also improved.

No problem, I can rerun it once Frontier returns on Thursday.

Jun 04 '24 14:06 brtnfld

@brtnfld - Before I get too far down the path to profiling, etc., could you please do a performance test of subfiling on this branch: https://github.com/qkoziol/hdf5/tree/non_rec_rwlock ? I've updated the thread pool to improve latency and I'm moderately hopeful that performance also improved.

No problem, I can rerun it once Frontier returns on Thursday.

Super, thanks!

Jun 04 '24 15:06 qkoziol

As an extra point of information, just running the subfiling regression test locally on a Samsung SSD gives me these results:

time mpirun -np 24 --oversubscribe t_subfiling_vfd

Before changes:

real    1m43.879s
user    14m13.310s
sys    25m37.165s

After changes:

real    9m46.380s
user    77m45.929s
sys    146m25.289s

I'll do some more investigation with subfiling debugging on to get some timings and see if I can't figure out where the slowdown is coming from.

Jun 04 '24 15:06 jhendersonHDF

As an extra point of information, just running the subfiling regression test locally on a Samsung SSD gives me these results:

time mpirun -np 24 --oversubscribe t_subfiling_vfd

Before changes:
real    1m43.879s
user    14m13.310s
sys    25m37.165s
After changes:
real    9m46.380s
user    77m45.929s
sys    146m25.289s
I'll do some more investigation with subfiling debugging on to get some timings and see if I can't figure out where the slowdown is coming from.

Interesting, I wonder what the issue is? Thanks for looking into it!

Jun 04 '24 16:06 qkoziol

Reran with non_rec_rwlock for 256, not much of a difference.

PR2.pdf

Jun 05 '24 14:06 brtnfld

Reran with non_rec_rwlock for 256, not much of a difference.

PR2.pdf

OK, thanks, I'll dig in

Jun 05 '24 14:06 qkoziol

Reran with non_rec_rwlock for 256, not much of a difference. PR2.pdf

OK, thanks, I'll dig in

Worth noting that the new thread pool will likely be performing significantly more mallocs/frees than the mercury implementation since the new one allocates a H5TS_pool_task_t task structure each time a task is added to the pool. With that extra layer of indirection, there also may be some cache locality issues. Not sure that that could be responsible for such a gap in performance, but it's the first thing that sticks out to me.

Jun 05 '24 15:06 jhendersonHDF

Reran with non_rec_rwlock for 256, not much of a difference. PR2.pdf

OK, thanks, I'll dig in

Worth noting that the new thread pool will likely be performing significantly more mallocs/frees than the mercury implementation since the new one allocates a H5TS_pool_task_t task structure each time a task is added to the pool. With that extra layer of indirection, there also may be some cache locality issues. Not sure that that could be responsible for such a gap in performance, but it's the first thing that sticks out to me.

Agree, that's one of the aspects I am going to investigate.

Jun 05 '24 16:06 qkoziol

@brtnfld - I just committed some optimizations to the non_rec_rwlock branch's version of the thread pool. Could you please re-test that branch?

Jun 06 '24 19:06 qkoziol

Reran with non_rec_rwlock for 256, not much of a difference. PR2.pdf

OK, thanks, I'll dig in

Worth noting that the new thread pool will likely be performing significantly more mallocs/frees than the mercury implementation since the new one allocates a H5TS_pool_task_t task structure each time a task is added to the pool. With that extra layer of indirection, there also may be some cache locality issues. Not sure that that could be responsible for such a gap in performance, but it's the first thing that sticks out to me.

Agree, that's one of the aspects I am going to investigate.

Turns out that the malloc/free overhead is very low (<0.3% of the runtime), so probably not the issue.

Jun 06 '24 19:06 qkoziol