[REQUEST] Support building single-threaded OpenVDB
These days, systems with 100+ hardware threads are increasingly common, but TBB performance on such systems is atrocious; it is particularly bad on non-x86 systems with more advanced memory models that have "light weight" synchronization primitives (that TBB does not use.)
As an example, I just moved a code to a 350-hardware-thread system and performance dropped by 300x compared to my 8-core desktop, because TBB's concurrent hash map is... not the best.
Admittedly, I am creating and destroying lots of ValueAccessors, but it is sad that the reason this is bad is because 99.9% of the time (yes, this is a correctly-rounded measurement 🤣 ) is spent in TBB's concurrent hash map.
Another issue I run into (this one costing another 2x factor of runtime on large systems) is due to the use of std::atomic, mutexes, and so on, which are not free, even when uncontended.
Describe the solution you'd like
It would be ideal if OpenVDB could have a "single threaded" configuration option, so that all uses of atomic ops, mutexes, TBB, etc., are removed - this leaves a clean, faster (potentially much faster) code path for users to manage whatever concurrency they might have on their own.
Failing that, it would be great if the use of TBB could be made optional at configure time - this avoids the dependency and the catastrophically poor performance of TBB on large, non-x86 systems.
Describe alternatives you've considered
Completely removing TBB might be simpler than maintaining it - more loosely-coupled, task-oriented concurrency is a much saner approach for current and future systems, instead of the current fine-grained, broadcast-heavy approach taken in e.g. {Const}AccessorRegistry
This is definitely something we have considered in the past. I don't think we would be looking to completely strip out TBB, that seems impractical, but I agree with you that it should at least be possible to build with a different threading implementation or to disable it altogether and run single-threaded. We have made a start (of sorts) by consolidating some TBB-related functionality into a new threading header (https://github.com/AcademySoftwareFoundation/openvdb/blob/master/openvdb/openvdb/thread/Threading.h). The first real attempt would likely be to wrap TBB in our own OpenVDB primitives, but even that is a substantial amount of work.
In the short term, something you can try if you like is to use the "unsafe" ValueAccessor that does not register itself with the tree, so doesn't go through the mutex at all:
https://github.com/AcademySoftwareFoundation/openvdb/blob/master/openvdb/openvdb/tree/Tree.h#L681
We also don't make a lot of usage of atomics and other synchronization primitives. Probably the most notable usage is delayed-loading, which is a feature that can be disabled:
https://github.com/AcademySoftwareFoundation/openvdb/blob/master/CMakeLists.txt#L83
Thank you for these really interesting comments!
Paving the way towards optional, configurable threading seems awesome, I was unaware steps were already being taken in that direction.
I was also unaware of getUnsafeAccessor() - which is brilliant! That resolved the immediate performance bugs I was running into. Since standard C++ containers have the same kind of safety guarantee: multiple concurrent readers are fine, so long as nobody is writing - far from being unsafe, it feels like these are normal accessors and the registry is an optional extra. I suspect that larger applications using OpenVDB as a data structure are going to have their own threading considerations, and so if they're already doing some work to manage concurrency, there's not a lot of extra value in OpenVDB offering its own. (But I understand that other applications making use of OpenVDB's tools on larger datasets could save a lot of time with those tools running in parallel...)
We also don't make a lot of usage of atomics and other synchronization primitives. Probably the most notable usage is delayed-loading, which is a feature that can be disabled:
https://github.com/AcademySoftwareFoundation/openvdb/blob/master/CMakeLists.txt#L83
I see. I had in mind the accessor registry - which, via the TBB concurrent hash map - ends up choking on synchronization.
As a side note, I have found that folly's ConcurrentHashMap (https://github.com/facebook/folly/blob/main/folly/concurrency/ConcurrentHashMap.h) doesn't have TBB's performance bugs, and happily scales to hundreds of threads.