noisepage icon indicating copy to clipboard operation
noisepage copied to clipboard

Deprecated tbb::task_scheduler_init

Open turingcompl33t opened this issue 4 years ago • 4 comments

The tbb::task_scheduler_init class is deprecated. Not really a pressing issue, and would not be noticeable if it weren't for the fact that it pops up some warnings during compilation thanks to a #pragma message() in the tbb headers.

The Intel documentation does not appear to suggest an alternative to tbb::task_scheduler_init, however, it appears that we currently only use it for its tbb::task_scheduler_init::default_num_threads() static member function - that is, we aren't using it to limit the concurrency of parallel operations. According to the documentation, this just returns the number of logical processors available to the current process (may differ from the total number of logical processors in the system, but likely a rare scenario, at least for where NoisePage is running right now) so we could likely replace this with std::thread::hardware_concurrency() without altering behavior.

That said, it also probably wouldn't prove that difficult to roll our own implementation that takes the affinity mask of the current process into account.

turingcompl33t avatar Jan 27 '21 01:01 turingcompl33t

task_scheduler_init was replaced by task_arenaand global_control see here: tbbrevamp.pdf.

There are a few places where we call the init, which defines how many threads tbb functions could use (on a really old branch this was causing issues where a benchmark would always use all the threads available to the system). We've already started replacing the task_scheduler_init calls with task_arena's (table_vector_iterator.cpp) but the other calls have not been replaced. On my experimental branches, I've switched to using the num_parallel_execution_threads instead of default_num_threads() but I do hardcode these at the moment and override the default number of (1 if settings manager is enabled, -1 otherwise).

@eppingere and I used sched_getaffinity to get the affinity mask of the current process in our 721 project last year, but unfortunately couldn't find an operation for OSX.

thepinetree avatar Jan 27 '21 01:01 thepinetree

I think osx has one i just think we decided it wasn't worth it or something?

https://developer.apple.com/library/archive/releasenotes/Performance/RN-AffinityAPI/#//apple_ref/doc/uid/TP40006635-CH1-DontLinkElementID_2

eppingere avatar Jan 27 '21 01:01 eppingere

Ah yep, we couldn't bind anything to a processor so just didn't proceed with it:

OS X does not export interfaces that identify processors or control thread placement—explicit thread to processor binding is not supported. Instead, the kernel manages all thread placement. Applications expect that the scheduler will, under most circumstances, run its threads using a good processor placement with respect to cache affinity.

thepinetree avatar Jan 27 '21 01:01 thepinetree

I suspect we might need a TBB refactor and that our database does not quite adhere to parallelization control in a few places. A number of TBB parallel calls do not immediately make it clear that they are using the correct number of threads per our parallel execution settings (or any other specific value -- otherwise they default, as @turingcompl33t noted to the number of processors available), nor do they seem to be always recorded in metrics (SetNumConcurrentEstimate). (see: https://github.com/cmu-db/noisepage/blob/fbd927dd8e35bab27a2c7e98568ded71af4a1ea9/src/execution/sql/aggregation_hash_table.cpp#L678, https://github.com/cmu-db/noisepage/blob/fbd927dd8e35bab27a2c7e98568ded71af4a1ea9/src/execution/sql/aggregation_hash_table.cpp#L719 for an example of each).

Indeed, TBB documentation notes:

If a parallel construct is used without task_scheduler_init object previously created, the scheduler will be initialized automatically with default settings, and will persist until this thread exits. Default concurrency level is defined as described in task_scheduler_init::initialize().

A discussion is needed, but I would propose that we rid ourselves of all calls to task_scheduler_init and we wrap each TBB parallel construct in a task_arena which inherits the setting value with a method to be determined (possibly OSX/Linux macros and an appropriate affinity mask call). This does leave some questions (regarding module.cpp off the top of my head see: tbb::task), which uses TBB task calls that I'm not sure utilize parallelism, but it might not be possible to visit settings at this point.

thepinetree avatar Jan 27 '21 05:01 thepinetree