hdf5 icon indicating copy to clipboard operation
hdf5 copied to clipboard

Refactor threading and other concurrency support

Open qkoziol opened this issue 1 year ago • 95 comments

Complete overhaul of the concurrency-related aspects of the library (threading, atomics, locking, etc.), adding private routines in the H5TS package to allow internal algorithms to use all of these capabilities.

Adds a configure option (--enable-threads / HDF5_ENABLE_THREADS), enabled by default, to control whether threading is enabled within the library.

Now supports C11, pthreads, and Windows threading for all H5TS capabilities, except the recursive readers/writers lock, which is not supported on Windows (because Windows threads don't provide a callback on thread-local variable deletion).

Removes the mercury threading code.

qkoziol avatar May 09 '24 16:05 qkoziol

My understanding from reading the Win32 docs and stackoverflow is that if we switch to using Fls* (fiber local storage) functions instead of Tls* we'll be able to use the thread destructor callback to free the local storage, and it will work fine with fibers or threads, as long as the calling application doesn't mix the two, though I suppose if we create threads internally with H5TS_thread_create then this might not work if the application uses fibers (I'm not certain on this).

fortnern avatar May 20 '24 22:05 fortnern

Alternatively I suppose we could create a global list of TLS values to free, and free them in the atexit callback. It wouldn't be hard to make the list lockfree, since only additions need to be threadsafe. Of course this means memory growth if we end up creating and destroying lots of mutexes during runtime (do we anticipate doing that, or creating each mutex for the life of the program?)

fortnern avatar May 20 '24 22:05 fortnern

My understanding is fibers are like threads but with different scheduling semantics

fortnern avatar May 20 '24 22:05 fortnern

My understanding from reading the Win32 docs and stackoverflow is that if we switch to using Fls* (fiber local storage) functions instead of Tls* we'll be able to use the thread destructor callback to free the local storage, and it will work fine with fibers or threads, as long as the calling application doesn't mix the two, though I suppose if we create threads internally with H5TS_thread_create then this won't work if the application uses fibers.

Cool - I wasn't aware of those. I'll take a look tonight and see if they will work.

qkoziol avatar May 20 '24 23:05 qkoziol

https://devblogs.microsoft.com/oldnewthing/20191011-00/?p=102989

fortnern avatar May 20 '24 23:05 fortnern

https://devblogs.microsoft.com/oldnewthing/20191011-00/?p=102989

I'm reading https://learn.microsoft.com/en-us/windows/win32/procthread/fibers and have some concerns about using fiber local storage.

What happens (in the probably rare case) when a Windows application that uses multiple fibers within a thread calls into the library from different fibers? (e.g. the example in the MS fiber docs that uses I/O operations from multiple fibers) It seems like we would allocate another fiber local value for each fiber from a thread.

Having fiber local storage on each fiber, but mutexes, condition variables, etc. per-thread, looks like a recipe for mismatch and trouble. Hypothetically, one fiber could enter the library, acquire the API lock, make a callback that switched to another fiber, and look like another thread to the library, which would not obey the reentrancy conditions I outline in the RFC I've circulated. Probably extremely rare, but definitely possible.

In addition, the thread has to be converted to a fiber first, via ConvertThreadToFiber. This seems to be a requirement: https://stackoverflow.com/questions/19613016/fls-vs-tls-can-i-use-fiber-local-storage-in-place-of-tls . I don't think it's a good idea to do that within the library.

I'm inclined to not worry about the limit on thread local storage for Windows, since the Visual Studio compiler has supported C11 threads for several years and the limit should go away whenever we switch to requiring C11 compliant compilers.

qkoziol avatar May 21 '24 02:05 qkoziol

In addition, the thread has to be converted to a fiber first, via ConvertThreadToFiber. This seems to be a requirement: https://stackoverflow.com/questions/19613016/fls-vs-tls-can-i-use-fiber-local-storage-in-place-of-tls . I don't think it's a good idea to do that within the library.

The "top" answer says you actually don't need to convert the thread to a fiber, regardless of what the docs say. Also, it's worth noting that the reply that says you do need to convert it to a fiber is from Raymond Chen, who I assume is the same person who wrote the blog post I mentioned. You'd think he would've mentioned that requirement in the blog post. The SO reply is from 2013 and the blog post is 2019, maybe he realized it wasn't actually necessary in the interim?

At any rate, I didn't catch that the windows implementation will work if C11 atomics/threading are supported, so it's probably fine as is, we'll just want to add a configure time failure if thread concurrency is requested and netiher C11 nor pthreads are supported (when thread concurrency is added, assuming it requires the recursive rw lock).

fortnern avatar May 21 '24 14:05 fortnern

In addition, the thread has to be converted to a fiber first, via ConvertThreadToFiber. This seems to be a requirement: https://stackoverflow.com/questions/19613016/fls-vs-tls-can-i-use-fiber-local-storage-in-place-of-tls . I don't think it's a good idea to do that within the library.

The "top" answer says you actually don't need to convert the thread to a fiber, regardless of what the docs say. Also, it's worth noting that the reply that says you do need to convert it to a fiber is from Raymond Chen, who I assume is the same person who wrote the blog post I mentioned. You'd think he would've mentioned that requirement in the blog post. The SO reply is from 2013 and the blog post is 2019, maybe he realized it wasn't actually necessary in the interim?

Yes, I saw the discussion, but it didn't seem like a good idea to depend on undocumented / unsupported behavior.

At any rate, I didn't catch that the windows implementation will work if C11 atomics/threading are supported, so it's probably fine as is, we'll just want to add a configure time failure if thread concurrency is requested and netiher C11 nor pthreads are supported (when thread concurrency is added, assuming it requires the recursive rw lock).

I currently believe that the recursive R/W lock is not going to solve the problems we will have, so I'm not worried about it being unavailable on Windows. If the rec R/W locks turn out to be necessary, I'll either add a configure failure or work on the thread local storage for Windows in some way.

qkoziol avatar May 21 '24 14:05 qkoziol

Still debugging, but I can't seem to enable subfiling with these changes. CMake always fails when trying to find MPI_Comm_split_type from MPI. Reverting the patch I can enable the VFD again.

jhendersonHDF avatar May 24 '24 17:05 jhendersonHDF

Still debugging, but I can't seem to enable subfiling with these changes. CMake always fails when trying to find MPI_Comm_split_type from MPI. Reverting the patch I can enable the VFD again.

OK, thanks for checking - subfiling is working for me with the autotools builds.

qkoziol avatar May 24 '24 18:05 qkoziol

Also, just a thought but it might be nice if the threading tests all printed out what threading model is being used for debugging purposes.

jhendersonHDF avatar May 24 '24 18:05 jhendersonHDF

Also, just a thought but it might be nice if the threading tests all printed out what threading model is being used for debugging purposes.

That's already in the configure output, but no harm in adding it here too.

qkoziol avatar May 24 '24 18:05 qkoziol

Also, just a thought but it might be nice if the threading tests all printed out what threading model is being used for debugging purposes.

That's already in the configure output, but no harm in adding it here too.

True, I wouldn't necessarily worry about it for this PR but it would be convenient to know right at a glance.

jhendersonHDF avatar May 24 '24 19:05 jhendersonHDF

Also, just a thought but it might be nice if the threading tests all printed out what threading model is being used for debugging purposes.

That's already in the configure output, but no harm in adding it here too.

True, I wouldn't necessarily worry about it for this PR but it would be convenient to know right at a glance.

I added some info to the output for the ttsafe test, although it's now a little non-standard compared to the rest of the tests.

qkoziol avatar May 25 '24 00:05 qkoziol

Still debugging, but I can't seem to enable subfiling with these changes. CMake always fails when trying to find MPI_Comm_split_type from MPI. Reverting the patch I can enable the VFD again.

OK, thanks for checking - subfiling is working for me with the autotools builds.

@jhendersonHDF - I just noticed a place in the subfiling CMake configuration that I missed earlier. Can you please try again now?

qkoziol avatar May 28 '24 23:05 qkoziol

@jhendersonHDF - I just noticed a place in the subfiling CMake configuration that I missed earlier. Can you please try again now?

Still running into the same issue, but I believe I have a fix that just involves rearranging some lines in the CMake code. Will get a patch to you later in the afternoon.

jhendersonHDF avatar May 29 '24 16:05 jhendersonHDF

@jhendersonHDF - I just noticed a place in the subfiling CMake configuration that I missed earlier. Can you please try again now?

Still running into the same issue, but I believe I have a fix that just involves rearranging some lines in the CMake code. Will get a patch to you later in the afternoon.

Cool , thanks!

qkoziol avatar May 29 '24 16:05 qkoziol

Frontier subfiling performance results: the 64 node case was smaller, but the 128 and 256 cases were larger but the same size. Each test was run a minimum of 5 times. I also encounter a seg fault in one of the runs at 256 nodes using the PR version

PR.pdf

brtnfld avatar Jun 03 '24 22:06 brtnfld

Needs to address performance regression.

Thanks for running these tests, I'll start looking into the performance tomorrow.

qkoziol avatar Jun 04 '24 01:06 qkoziol

@brtnfld - Before I get too far down the path to profiling, etc., could you please do a performance test of subfiling on this branch: https://github.com/qkoziol/hdf5/tree/non_rec_rwlock ? I've updated the thread pool to improve latency and I'm moderately hopeful that performance also improved.

qkoziol avatar Jun 04 '24 14:06 qkoziol

@brtnfld - Before I get too far down the path to profiling, etc., could you please do a performance test of subfiling on this branch: https://github.com/qkoziol/hdf5/tree/non_rec_rwlock ? I've updated the thread pool to improve latency and I'm moderately hopeful that performance also improved.

No problem, I can rerun it once Frontier returns on Thursday.

brtnfld avatar Jun 04 '24 14:06 brtnfld

@brtnfld - Before I get too far down the path to profiling, etc., could you please do a performance test of subfiling on this branch: https://github.com/qkoziol/hdf5/tree/non_rec_rwlock ? I've updated the thread pool to improve latency and I'm moderately hopeful that performance also improved.

No problem, I can rerun it once Frontier returns on Thursday.

Super, thanks!

qkoziol avatar Jun 04 '24 15:06 qkoziol

As an extra point of information, just running the subfiling regression test locally on a Samsung SSD gives me these results:

time mpirun -np 24 --oversubscribe t_subfiling_vfd

Before changes:

real    1m43.879s
user    14m13.310s
sys    25m37.165s

After changes:

real    9m46.380s
user    77m45.929s
sys    146m25.289s

I'll do some more investigation with subfiling debugging on to get some timings and see if I can't figure out where the slowdown is coming from.

jhendersonHDF avatar Jun 04 '24 15:06 jhendersonHDF

As an extra point of information, just running the subfiling regression test locally on a Samsung SSD gives me these results:

time mpirun -np 24 --oversubscribe t_subfiling_vfd

Before changes:

real    1m43.879s
user    14m13.310s
sys    25m37.165s

After changes:

real    9m46.380s
user    77m45.929s
sys    146m25.289s

I'll do some more investigation with subfiling debugging on to get some timings and see if I can't figure out where the slowdown is coming from.

Interesting, I wonder what the issue is? Thanks for looking into it!

qkoziol avatar Jun 04 '24 16:06 qkoziol

Reran with non_rec_rwlock for 256, not much of a difference.

PR2.pdf

brtnfld avatar Jun 05 '24 14:06 brtnfld

Reran with non_rec_rwlock for 256, not much of a difference.

PR2.pdf

OK, thanks, I'll dig in

qkoziol avatar Jun 05 '24 14:06 qkoziol

Reran with non_rec_rwlock for 256, not much of a difference. PR2.pdf

OK, thanks, I'll dig in

Worth noting that the new thread pool will likely be performing significantly more mallocs/frees than the mercury implementation since the new one allocates a H5TS_pool_task_t task structure each time a task is added to the pool. With that extra layer of indirection, there also may be some cache locality issues. Not sure that that could be responsible for such a gap in performance, but it's the first thing that sticks out to me.

jhendersonHDF avatar Jun 05 '24 15:06 jhendersonHDF

Reran with non_rec_rwlock for 256, not much of a difference. PR2.pdf

OK, thanks, I'll dig in

Worth noting that the new thread pool will likely be performing significantly more mallocs/frees than the mercury implementation since the new one allocates a H5TS_pool_task_t task structure each time a task is added to the pool. With that extra layer of indirection, there also may be some cache locality issues. Not sure that that could be responsible for such a gap in performance, but it's the first thing that sticks out to me.

Agree, that's one of the aspects I am going to investigate.

qkoziol avatar Jun 05 '24 16:06 qkoziol

@brtnfld - I just committed some optimizations to the non_rec_rwlock branch's version of the thread pool. Could you please re-test that branch?

qkoziol avatar Jun 06 '24 19:06 qkoziol

Reran with non_rec_rwlock for 256, not much of a difference. PR2.pdf

OK, thanks, I'll dig in

Worth noting that the new thread pool will likely be performing significantly more mallocs/frees than the mercury implementation since the new one allocates a H5TS_pool_task_t task structure each time a task is added to the pool. With that extra layer of indirection, there also may be some cache locality issues. Not sure that that could be responsible for such a gap in performance, but it's the first thing that sticks out to me.

Agree, that's one of the aspects I am going to investigate.

Turns out that the malloc/free overhead is very low (<0.3% of the runtime), so probably not the issue.

qkoziol avatar Jun 06 '24 19:06 qkoziol