Refactor threading and other concurrency support
Complete overhaul of the concurrency-related aspects of the library (threading, atomics, locking, etc.), adding private routines in the H5TS package to allow internal algorithms to use all of these capabilities.
Adds a configure option (--enable-threads / HDF5_ENABLE_THREADS), enabled by default, to control whether threading is enabled within the library.
Now supports C11, pthreads, and Windows threading for all H5TS capabilities, except the recursive readers/writers lock, which is not supported on Windows (because Windows threads don't provide a callback on thread-local variable deletion).
Removes the mercury threading code.
My understanding from reading the Win32 docs and stackoverflow is that if we switch to using Fls* (fiber local storage) functions instead of Tls* we'll be able to use the thread destructor callback to free the local storage, and it will work fine with fibers or threads, as long as the calling application doesn't mix the two, though I suppose if we create threads internally with H5TS_thread_create then this might not work if the application uses fibers (I'm not certain on this).
Alternatively I suppose we could create a global list of TLS values to free, and free them in the atexit callback. It wouldn't be hard to make the list lockfree, since only additions need to be threadsafe. Of course this means memory growth if we end up creating and destroying lots of mutexes during runtime (do we anticipate doing that, or creating each mutex for the life of the program?)
My understanding is fibers are like threads but with different scheduling semantics
My understanding from reading the Win32 docs and stackoverflow is that if we switch to using Fls* (fiber local storage) functions instead of Tls* we'll be able to use the thread destructor callback to free the local storage, and it will work fine with fibers or threads, as long as the calling application doesn't mix the two, though I suppose if we create threads internally with H5TS_thread_create then this won't work if the application uses fibers.
Cool - I wasn't aware of those. I'll take a look tonight and see if they will work.
https://devblogs.microsoft.com/oldnewthing/20191011-00/?p=102989
https://devblogs.microsoft.com/oldnewthing/20191011-00/?p=102989
I'm reading https://learn.microsoft.com/en-us/windows/win32/procthread/fibers and have some concerns about using fiber local storage.
What happens (in the probably rare case) when a Windows application that uses multiple fibers within a thread calls into the library from different fibers? (e.g. the example in the MS fiber docs that uses I/O operations from multiple fibers) It seems like we would allocate another fiber local value for each fiber from a thread.
Having fiber local storage on each fiber, but mutexes, condition variables, etc. per-thread, looks like a recipe for mismatch and trouble. Hypothetically, one fiber could enter the library, acquire the API lock, make a callback that switched to another fiber, and look like another thread to the library, which would not obey the reentrancy conditions I outline in the RFC I've circulated. Probably extremely rare, but definitely possible.
In addition, the thread has to be converted to a fiber first, via ConvertThreadToFiber. This seems to be a requirement: https://stackoverflow.com/questions/19613016/fls-vs-tls-can-i-use-fiber-local-storage-in-place-of-tls . I don't think it's a good idea to do that within the library.
I'm inclined to not worry about the limit on thread local storage for Windows, since the Visual Studio compiler has supported C11 threads for several years and the limit should go away whenever we switch to requiring C11 compliant compilers.
In addition, the thread has to be converted to a fiber first, via ConvertThreadToFiber. This seems to be a requirement: https://stackoverflow.com/questions/19613016/fls-vs-tls-can-i-use-fiber-local-storage-in-place-of-tls . I don't think it's a good idea to do that within the library.
The "top" answer says you actually don't need to convert the thread to a fiber, regardless of what the docs say. Also, it's worth noting that the reply that says you do need to convert it to a fiber is from Raymond Chen, who I assume is the same person who wrote the blog post I mentioned. You'd think he would've mentioned that requirement in the blog post. The SO reply is from 2013 and the blog post is 2019, maybe he realized it wasn't actually necessary in the interim?
At any rate, I didn't catch that the windows implementation will work if C11 atomics/threading are supported, so it's probably fine as is, we'll just want to add a configure time failure if thread concurrency is requested and netiher C11 nor pthreads are supported (when thread concurrency is added, assuming it requires the recursive rw lock).
In addition, the thread has to be converted to a fiber first, via ConvertThreadToFiber. This seems to be a requirement: https://stackoverflow.com/questions/19613016/fls-vs-tls-can-i-use-fiber-local-storage-in-place-of-tls . I don't think it's a good idea to do that within the library.
The "top" answer says you actually don't need to convert the thread to a fiber, regardless of what the docs say. Also, it's worth noting that the reply that says you do need to convert it to a fiber is from Raymond Chen, who I assume is the same person who wrote the blog post I mentioned. You'd think he would've mentioned that requirement in the blog post. The SO reply is from 2013 and the blog post is 2019, maybe he realized it wasn't actually necessary in the interim?
Yes, I saw the discussion, but it didn't seem like a good idea to depend on undocumented / unsupported behavior.
At any rate, I didn't catch that the windows implementation will work if C11 atomics/threading are supported, so it's probably fine as is, we'll just want to add a configure time failure if thread concurrency is requested and netiher C11 nor pthreads are supported (when thread concurrency is added, assuming it requires the recursive rw lock).
I currently believe that the recursive R/W lock is not going to solve the problems we will have, so I'm not worried about it being unavailable on Windows. If the rec R/W locks turn out to be necessary, I'll either add a configure failure or work on the thread local storage for Windows in some way.
Still debugging, but I can't seem to enable subfiling with these changes. CMake always fails when trying to find MPI_Comm_split_type from MPI. Reverting the patch I can enable the VFD again.
Still debugging, but I can't seem to enable subfiling with these changes. CMake always fails when trying to find
MPI_Comm_split_typefrom MPI. Reverting the patch I can enable the VFD again.
OK, thanks for checking - subfiling is working for me with the autotools builds.
Also, just a thought but it might be nice if the threading tests all printed out what threading model is being used for debugging purposes.
Also, just a thought but it might be nice if the threading tests all printed out what threading model is being used for debugging purposes.
That's already in the configure output, but no harm in adding it here too.
Also, just a thought but it might be nice if the threading tests all printed out what threading model is being used for debugging purposes.
That's already in the configure output, but no harm in adding it here too.
True, I wouldn't necessarily worry about it for this PR but it would be convenient to know right at a glance.
Also, just a thought but it might be nice if the threading tests all printed out what threading model is being used for debugging purposes.
That's already in the configure output, but no harm in adding it here too.
True, I wouldn't necessarily worry about it for this PR but it would be convenient to know right at a glance.
I added some info to the output for the ttsafe test, although it's now a little non-standard compared to the rest of the tests.
Still debugging, but I can't seem to enable subfiling with these changes. CMake always fails when trying to find
MPI_Comm_split_typefrom MPI. Reverting the patch I can enable the VFD again.OK, thanks for checking - subfiling is working for me with the autotools builds.
@jhendersonHDF - I just noticed a place in the subfiling CMake configuration that I missed earlier. Can you please try again now?
@jhendersonHDF - I just noticed a place in the subfiling CMake configuration that I missed earlier. Can you please try again now?
Still running into the same issue, but I believe I have a fix that just involves rearranging some lines in the CMake code. Will get a patch to you later in the afternoon.
@jhendersonHDF - I just noticed a place in the subfiling CMake configuration that I missed earlier. Can you please try again now?
Still running into the same issue, but I believe I have a fix that just involves rearranging some lines in the CMake code. Will get a patch to you later in the afternoon.
Cool , thanks!
Frontier subfiling performance results: the 64 node case was smaller, but the 128 and 256 cases were larger but the same size. Each test was run a minimum of 5 times. I also encounter a seg fault in one of the runs at 256 nodes using the PR version
Needs to address performance regression.
Thanks for running these tests, I'll start looking into the performance tomorrow.
@brtnfld - Before I get too far down the path to profiling, etc., could you please do a performance test of subfiling on this branch: https://github.com/qkoziol/hdf5/tree/non_rec_rwlock ? I've updated the thread pool to improve latency and I'm moderately hopeful that performance also improved.
@brtnfld - Before I get too far down the path to profiling, etc., could you please do a performance test of subfiling on this branch: https://github.com/qkoziol/hdf5/tree/non_rec_rwlock ? I've updated the thread pool to improve latency and I'm moderately hopeful that performance also improved.
No problem, I can rerun it once Frontier returns on Thursday.
@brtnfld - Before I get too far down the path to profiling, etc., could you please do a performance test of subfiling on this branch: https://github.com/qkoziol/hdf5/tree/non_rec_rwlock ? I've updated the thread pool to improve latency and I'm moderately hopeful that performance also improved.
No problem, I can rerun it once Frontier returns on Thursday.
Super, thanks!
As an extra point of information, just running the subfiling regression test locally on a Samsung SSD gives me these results:
time mpirun -np 24 --oversubscribe t_subfiling_vfd
Before changes:
real 1m43.879s
user 14m13.310s
sys 25m37.165s
After changes:
real 9m46.380s
user 77m45.929s
sys 146m25.289s
I'll do some more investigation with subfiling debugging on to get some timings and see if I can't figure out where the slowdown is coming from.
As an extra point of information, just running the subfiling regression test locally on a Samsung SSD gives me these results:
time mpirun -np 24 --oversubscribe t_subfiling_vfdBefore changes:
real 1m43.879s user 14m13.310s sys 25m37.165sAfter changes:
real 9m46.380s user 77m45.929s sys 146m25.289sI'll do some more investigation with subfiling debugging on to get some timings and see if I can't figure out where the slowdown is coming from.
Interesting, I wonder what the issue is? Thanks for looking into it!
Reran with non_rec_rwlock for 256, not much of a difference. PR2.pdf
OK, thanks, I'll dig in
Worth noting that the new thread pool will likely be performing significantly more mallocs/frees than the mercury implementation since the new one allocates a H5TS_pool_task_t task structure each time a task is added to the pool. With that extra layer of indirection, there also may be some cache locality issues. Not sure that that could be responsible for such a gap in performance, but it's the first thing that sticks out to me.
Reran with non_rec_rwlock for 256, not much of a difference. PR2.pdf
OK, thanks, I'll dig in
Worth noting that the new thread pool will likely be performing significantly more mallocs/frees than the mercury implementation since the new one allocates a
H5TS_pool_task_ttask structure each time a task is added to the pool. With that extra layer of indirection, there also may be some cache locality issues. Not sure that that could be responsible for such a gap in performance, but it's the first thing that sticks out to me.
Agree, that's one of the aspects I am going to investigate.
@brtnfld - I just committed some optimizations to the non_rec_rwlock branch's version of the thread pool. Could you please re-test that branch?
Reran with non_rec_rwlock for 256, not much of a difference. PR2.pdf
OK, thanks, I'll dig in
Worth noting that the new thread pool will likely be performing significantly more mallocs/frees than the mercury implementation since the new one allocates a
H5TS_pool_task_ttask structure each time a task is added to the pool. With that extra layer of indirection, there also may be some cache locality issues. Not sure that that could be responsible for such a gap in performance, but it's the first thing that sticks out to me.Agree, that's one of the aspects I am going to investigate.
Turns out that the malloc/free overhead is very low (<0.3% of the runtime), so probably not the issue.