Fibonacci: segfaults and scalability issues
I've been measuring the overhead of various multithreading runtimes. HPX seems to have a couple of issues with a naive fibonacci.
Expected Behavior
Fibonacci examples complete in a reasonable time
Actual Behavior
HPX either segfaults or takes a very long time for no reason
Steps to Reproduce the Problem
Problem 1: Sudden spike in time required to compute fibonacci.
- Compile all the quickstart fibonacci examples
- Run
bin/fibonacci_one --n-value 22 - Run
bin/fibonacci_one --n-value 23 - The time required for
fibonacci_directchanges from 0.218 seconds to 8.263 s

Problem 2: Segfaults
- Run fibonacci_await with any n and it segfaults

Problem 3: a rare segfault will make all the subsequent call take very long-time
I manage to segfault the fibonacci binary with a value of 30 and I couldn't reclaim control of my console even with Ctrl+C, following that the performance of everything was very degraded.
Before a segfault on fibonacci 30

After a segfault on fibonacci 30
- Run
bin/fibonacci --n-value 23, it takes 12s

Specifications
... Please describe your environment
- HPX Version: master - git: fd67ac0f86
- Platform (compiler, OS): GCC, Linux
Core library:
HPX_WITH_ACTION_BASE_COMPATIBILITY=ON
HPX_WITH_AGAS_DUMP_REFCNT_ENTRIES=OFF
HPX_WITH_APEX=OFF
HPX_WITH_ATTACH_DEBUGGER_ON_TEST_FAILURE=OFF
HPX_WITH_AUTOMATIC_SERIALIZATION_REGISTRATION=ON
HPX_WITH_COROUTINE_COUNTERS=OFF
HPX_WITH_CXX14_RETURN_TYPE_DEDUCTION=TRUE
HPX_WITH_DEPRECATION_WARNINGS=ON
HPX_WITH_GOOGLE_PERFTOOLS=OFF
HPX_WITH_HPXMP=OFF
HPX_WITH_IO_COUNTERS=ON
HPX_WITH_IO_POOL=ON
HPX_WITH_ITTNOTIFY=OFF
HPX_WITH_LOGGING=ON
HPX_WITH_MORE_THAN_64_THREADS=OFF
HPX_WITH_NATIVE_TLS=ON
HPX_WITH_NETWORKING=ON
HPX_WITH_PAPI=OFF
HPX_WITH_PARCELPORT_ACTION_COUNTERS=OFF
HPX_WITH_PARCELPORT_LIBFABRIC=OFF
HPX_WITH_PARCELPORT_MPI=OFF
HPX_WITH_PARCELPORT_MPI_MULTITHREADED=OFF
HPX_WITH_PARCELPORT_TCP=ON
HPX_WITH_PARCELPORT_VERBS=OFF
HPX_WITH_PARCEL_COALESCING=ON
HPX_WITH_PARCEL_PROFILING=OFF
HPX_WITH_REGISTER_THREAD_COMPATIBILITY=ON
HPX_WITH_SANITIZERS=OFF
HPX_WITH_SCHEDULER_LOCAL_STORAGE=OFF
HPX_WITH_SPINLOCK_DEADLOCK_DETECTION=OFF
HPX_WITH_STACKTRACES=ON
HPX_WITH_SWAP_CONTEXT_EMULATION=OFF
HPX_WITH_TESTS_DEBUG_LOG=OFF
HPX_WITH_THREAD_BACKTRACE_ON_SUSPENSION=OFF
HPX_WITH_THREAD_CREATION_AND_CLEANUP_RATES=OFF
HPX_WITH_THREAD_CUMULATIVE_COUNTS=ON
HPX_WITH_THREAD_DEBUG_INFO=OFF
HPX_WITH_THREAD_DESCRIPTION_FULL=OFF
HPX_WITH_THREAD_GUARD_PAGE=ON
HPX_WITH_THREAD_IDLE_RATES=OFF
HPX_WITH_THREAD_LOCAL_STORAGE=OFF
HPX_WITH_THREAD_MANAGER_IDLE_BACKOFF=ON
HPX_WITH_THREAD_QUEUE_WAITTIME=OFF
HPX_WITH_THREAD_STACK_MMAP=ON
HPX_WITH_THREAD_STEALING_COUNTS=OFF
HPX_WITH_THREAD_TARGET_ADDRESS=OFF
HPX_WITH_TIMER_POOL=ON
HPX_WITH_TUPLE_RVALUE_SWAP=ON
HPX_WITH_VALGRIND=OFF
HPX_WITH_VERIFY_LOCKS=OFF
HPX_WITH_VERIFY_LOCKS_BACKTRACE=OFF
HPX_WITH_VERIFY_LOCKS_GLOBALLY=OFF
HPX_PARCEL_MAX_CONNECTIONS=512
HPX_PARCEL_MAX_CONNECTIONS_PER_LOCALITY=4
HPX_AGAS_LOCAL_CACHE_SIZE=4096
HPX_HAVE_MALLOC=tcmalloc
HPX_PREFIX (configured)=/usr/local
HPX_PREFIX=/home/beta/Programming/Parallelism/hpx/build
Module algorithms:
HPX_ALGORITHMS_WITH_COMPATIBILITY_HEADERS=OFF
HPX_ALGORITHMS_WITH_DEPRECATION_WARNINGS=OFF
HPX_ALGORITHMS_WITH_TESTS=OFF
Module allocator_support:
HPX_ALLOCATOR_SUPPORT_WITH_COMPATIBILITY_HEADERS=ON
HPX_ALLOCATOR_SUPPORT_WITH_DEPRECATION_WARNINGS=ON
HPX_ALLOCATOR_SUPPORT_WITH_TESTS=OFF
Module assertion:
HPX_ASSERTION_WITH_COMPATIBILITY_HEADERS=ON
HPX_ASSERTION_WITH_DEPRECATION_WARNINGS=ON
HPX_ASSERTION_WITH_TESTS=OFF
Module basic_execution:
HPX_BASIC_EXECUTION_WITH_COMPATIBILITY_HEADERS=OFF
HPX_BASIC_EXECUTION_WITH_DEPRECATION_WARNINGS=OFF
HPX_BASIC_EXECUTION_WITH_TESTS=OFF
Module cache:
HPX_CACHE_WITH_COMPATIBILITY_HEADERS=ON
HPX_CACHE_WITH_DEPRECATION_WARNINGS=ON
HPX_CACHE_WITH_TESTS=OFF
Module checkpoint:
HPX_CHECKPOINT_WITH_COMPATIBILITY_HEADERS=ON
HPX_CHECKPOINT_WITH_DEPRECATION_WARNINGS=ON
HPX_CHECKPOINT_WITH_TESTS=OFF
Module collectives:
HPX_COLLECTIVES_WITH_COMPATIBILITY_HEADERS=ON
HPX_COLLECTIVES_WITH_DEPRECATION_WARNINGS=ON
HPX_COLLECTIVES_WITH_TESTS=OFF
Module compute:
HPX_COMPUTE_WITH_COMPATIBILITY_HEADERS=OFF
HPX_COMPUTE_WITH_DEPRECATION_WARNINGS=ON
HPX_COMPUTE_WITH_TESTS=OFF
Module compute_cuda:
HPX_COMPUTE_CUDA_WITH_COMPATIBILITY_HEADERS=OFF
HPX_COMPUTE_CUDA_WITH_DEPRECATION_WARNINGS=ON
HPX_COMPUTE_CUDA_WITH_TESTS=OFF
Module concepts:
HPX_CONCEPTS_WITH_COMPATIBILITY_HEADERS=ON
HPX_CONCEPTS_WITH_DEPRECATION_WARNINGS=ON
HPX_CONCEPTS_WITH_TESTS=OFF
Module concurrency:
HPX_CONCURRENCY_WITH_COMPATIBILITY_HEADERS=ON
HPX_CONCURRENCY_WITH_DEPRECATION_WARNINGS=ON
HPX_CONCURRENCY_WITH_TESTS=OFF
Module config:
HPX_CONFIG_WITH_COMPATIBILITY_HEADERS=OFF
HPX_CONFIG_WITH_DEPRECATION_WARNINGS=OFF
HPX_CONFIG_WITH_TESTS=OFF
Module coroutines:
HPX_COROUTINES_WITH_COMPATIBILITY_HEADERS=ON
HPX_COROUTINES_WITH_DEPRECATION_WARNINGS=ON
HPX_COROUTINES_WITH_TESTS=OFF
Module datastructures:
HPX_DATASTRUCTURES_WITH_ADAPT_STD_TUPLE=ON
HPX_DATASTRUCTURES_WITH_COMPATIBILITY_HEADERS=ON
HPX_DATASTRUCTURES_WITH_DEPRECATION_WARNINGS=ON
HPX_DATASTRUCTURES_WITH_TESTS=OFF
Module debugging:
HPX_DEBUGGING_WITH_COMPATIBILITY_HEADERS=ON
HPX_DEBUGGING_WITH_DEPRECATION_WARNINGS=ON
HPX_DEBUGGING_WITH_TESTS=OFF
Module errors:
HPX_ERRORS_WITH_COMPATIBILITY_HEADERS=ON
HPX_ERRORS_WITH_DEPRECATION_WARNINGS=ON
HPX_ERRORS_WITH_TESTS=OFF
Module execution:
HPX_EXECUTION_WITH_COMPATIBILITY_HEADERS=OFF
HPX_EXECUTION_WITH_DEPRECATION_WARNINGS=OFF
HPX_EXECUTION_WITH_TESTS=OFF
Module filesystem:
HPX_FILESYSTEM_WITH_BOOST_FILESYSTEM_COMPATIBILITY=OFF
HPX_FILESYSTEM_WITH_COMPATIBILITY_HEADERS=OFF
HPX_FILESYSTEM_WITH_DEPRECATION_WARNINGS=OFF
HPX_FILESYSTEM_WITH_TESTS=OFF
Module format:
HPX_FORMAT_WITH_COMPATIBILITY_HEADERS=ON
HPX_FORMAT_WITH_DEPRECATION_WARNINGS=ON
HPX_FORMAT_WITH_TESTS=OFF
Module functional:
HPX_FUNCTIONAL_WITH_COMPATIBILITY_HEADERS=ON
HPX_FUNCTIONAL_WITH_DEPRECATION_WARNINGS=ON
HPX_FUNCTIONAL_WITH_TESTS=OFF
Module hardware:
HPX_HARDWARE_WITH_COMPATIBILITY_HEADERS=ON
HPX_HARDWARE_WITH_DEPRECATION_WARNINGS=ON
HPX_HARDWARE_WITH_TESTS=OFF
Module hashing:
HPX_HASHING_WITH_COMPATIBILITY_HEADERS=ON
HPX_HASHING_WITH_DEPRECATION_WARNINGS=ON
HPX_HASHING_WITH_TESTS=OFF
Module iterator_support:
HPX_ITERATOR_SUPPORT_WITH_COMPATIBILITY_HEADERS=ON
HPX_ITERATOR_SUPPORT_WITH_DEPRECATION_WARNINGS=ON
HPX_ITERATOR_SUPPORT_WITH_TESTS=OFF
Module logging:
HPX_LOGGING_WITH_COMPATIBILITY_HEADERS=ON
HPX_LOGGING_WITH_DEPRECATION_WARNINGS=ON
HPX_LOGGING_WITH_TESTS=OFF
Module memory:
HPX_MEMORY_WITH_COMPATIBILITY_HEADERS=OFF
HPX_MEMORY_WITH_DEPRECATION_WARNINGS=ON
HPX_MEMORY_WITH_TESTS=OFF
Module plugin:
HPX_PLUGIN_WITH_COMPATIBILITY_HEADERS=ON
HPX_PLUGIN_WITH_DEPRECATION_WARNINGS=ON
HPX_PLUGIN_WITH_TESTS=OFF
Module preprocessor:
HPX_PREPROCESSOR_WITH_COMPATIBILITY_HEADERS=ON
HPX_PREPROCESSOR_WITH_DEPRECATION_WARNINGS=ON
HPX_PREPROCESSOR_WITH_TESTS=OFF
Module program_options:
HPX_PROGRAM_OPTIONS_WITH_BOOST_PROGRAM_OPTIONS_COMPATIBILITY=ON
HPX_PROGRAM_OPTIONS_WITH_COMPATIBILITY_HEADERS=OFF
HPX_PROGRAM_OPTIONS_WITH_DEPRECATION_WARNINGS=OFF
HPX_PROGRAM_OPTIONS_WITH_TESTS=OFF
Module resiliency:
HPX_RESILIENCY_WITH_COMPATIBILITY_HEADERS=OFF
HPX_RESILIENCY_WITH_DEPRECATION_WARNINGS=OFF
HPX_RESILIENCY_WITH_TESTS=OFF
Module resource_partitioner:
HPX_RESOURCE_PARTITIONER_WITH_COMPATIBILITY_HEADERS=ON
HPX_RESOURCE_PARTITIONER_WITH_DEPRECATION_WARNINGS=ON
HPX_RESOURCE_PARTITIONER_WITH_TESTS=OFF
Module segmented_algorithms:
HPX_SEGMENTED_ALGORITHMS_WITH_COMPATIBILITY_HEADERS=OFF
HPX_SEGMENTED_ALGORITHMS_WITH_DEPRECATION_WARNINGS=OFF
HPX_SEGMENTED_ALGORITHMS_WITH_TESTS=OFF
Module serialization:
HPX_SERIALIZATION_WITH_BOOST_TYPES=ON
HPX_SERIALIZATION_WITH_COMPATIBILITY_HEADERS=ON
HPX_SERIALIZATION_WITH_DEPRECATION_WARNINGS=ON
HPX_SERIALIZATION_WITH_TESTS=OFF
Module statistics:
HPX_STATISTICS_WITH_COMPATIBILITY_HEADERS=ON
HPX_STATISTICS_WITH_DEPRECATION_WARNINGS=ON
HPX_STATISTICS_WITH_TESTS=OFF
Module testing:
HPX_TESTING_WITH_COMPATIBILITY_HEADERS=ON
HPX_TESTING_WITH_DEPRECATION_WARNINGS=ON
HPX_TESTING_WITH_TESTS=OFF
Module thread_support:
HPX_THREAD_SUPPORT_WITH_COMPATIBILITY_HEADERS=ON
HPX_THREAD_SUPPORT_WITH_DEPRECATION_WARNINGS=ON
HPX_THREAD_SUPPORT_WITH_TESTS=OFF
Module threadmanager:
HPX_THREADMANAGER_WITH_COMPATIBILITY_HEADERS=ON
HPX_THREADMANAGER_WITH_DEPRECATION_WARNINGS=ON
HPX_THREADMANAGER_WITH_TESTS=OFF
Module timing:
HPX_TIMING_WITH_COMPATIBILITY_HEADERS=ON
HPX_TIMING_WITH_DEPRECATION_WARNINGS=ON
HPX_TIMING_WITH_TESTS=OFF
Module topology:
HPX_TOPOLOGY_WITH_COMPATIBILITY_HEADERS=ON
HPX_TOPOLOGY_WITH_DEPRECATION_WARNINGS=ON
HPX_TOPOLOGY_WITH_TESTS=OFF
Module type_support:
HPX_TYPE_SUPPORT_WITH_COMPATIBILITY_HEADERS=ON
HPX_TYPE_SUPPORT_WITH_DEPRECATION_WARNINGS=ON
HPX_TYPE_SUPPORT_WITH_TESTS=OFF
Module util:
HPX_UTIL_WITH_COMPATIBILITY_HEADERS=ON
HPX_UTIL_WITH_DEPRECATION_WARNINGS=ON
HPX_UTIL_WITH_TESTS=OFF
{version}: V1.4.0-trunk (AGAS: V3.0), Git: fd67ac0f86
{boost}: V1.71.0
{build-type}: release
{date}: Nov 18 2019 14:04:57
{platform}: linux
{compiler}: GNU C++ version 9.2.0
{stdlib}: GNU libstdc++ version 20190812
Thanks for reporting! However I don't think that the trivial Fibonacci implementation is a good benchmark for any runtime system... what is your real usecase?
It's a good benchmark to measure the overhead of runtimes.
I'm currently building my own runtime for the Nim programming language with a focus on very-fine grained parallelism. My goal is to expose parallelism to users even for tasks in the 1000 cycles as while we get more cores the minimum grain size is an issue.
Given that fibonacci is a benchmark with only a single instruction, it's actually a pure overhead benchmark to compare runtimes on.
For reference on fibonacci(40) on my machine:
- GCC OpenMP doesn't complete (due to the absence of load balancing and the use of a global queue protected by a single lock)
- LLVM OpenMP takes 2.1 sec
- Intel TBB takes either 600 msec or 1 sec depending if we use a closures or a class approach
For reference among the less known C and C++ runtimes
- Staccato takes 180 ms (Paper is paywalled unfortunately: Staccato: Cache-Aware Work-Stealing Task Scheduler for Shared-Memory System: http://dx.doi.org/10.1007/978-3-319-95171-3_8)
- Fibril takes 130 ms (Paper: A Practical Solution to the Cactus Stack Problem - http://chaoran.me/assets/pdf/ws-spaa16.pdf)
Both achieve those impressively low overheads by taking great attention to cache and memory locality.
On my own WIP runtime I also achieve 180 ms at the moment, however as my runtime is message/channels based instead of atomics/shared-memory based and I want portability across OSes and CPU arch, the main optimization in fibril is not applicable as it relies on both Linux and the x86-64 calling convention (but may be emulable via setjmp/longjmp).
My runtime is being developed here: https://github.com/mratsim/weave
As HPX / ParalleX is from the documentation "message-driven" I was interested in how it fares on the fibonacci overhead front.
As HPX launches a new 'thread' for each invocation of the simple Fibonacci (this includes a stack allocation and a full future instantiation) implementation it will compare very badly against anything that does not do that. The overhead of one task/thread in HPX is about 1 microsecond per task, you can measure that easily using the future-overheads benchmark (https://github.com/STEllAR-GROUP/hpx/blob/master/tests/performance/local/future_overhead.cpp).
BTW, if all you do is to run fib(40) sequentially it will most likely not take longer than a couple of milliseconds - so really what's the point...
BTW, if all you do is to run fib(40) sequentially it will most likely not take longer than a couple of milliseconds - so really what's the point...
The point is to get an idea of the worst case scenario of scheduling overhead for tasks that are too fine-grained for the runtime system.
For some algorithms especially the recursive ones, for example a tree algorithm, a user may not or maybe cannot size the tasks into sufficiently large chunks to amortize the runtime overhead because by the time it is known that we are at a leaf computation that is too small to be efficiently scheduled by traditional runtime, it is too late.
This is exacerbated by the growing number of cores available in current systems, which requires developers to create more tasks. A runtime with low overhead will greatly assist that by allowing developers to confidently create new tasks and parallelism opportunities knowing that the runtime will handle it efficiently instead of choking.
Now, optimizing for such scenarios might not be one of HPX priorities, just like GCC OpenMP also cannot handle fibonacci. But given how many of the quickstart examples are based on naive fibonacci, I think it would be valuable to address that overhead or at least mention in the doc that fibonacci is an unrealistic workload and that HPX recommends a minimum task size of x cycles to amortize scheduling overhead (for example TBB mentions 10000 cycles).
@mratsim Sure, I understand where you're coming from. The main question that interests me in the context of this discussion however would be where other runtimes 'cut corners' by operating under well defined constraints that allow for optimizations. We usually can't apply those optimizations to HPX because of its general purpose nature (at least not in the general case). OpenMP (before V5) for instance can rely on the fact that no task will exit the scope it was created in (which is the case for your Fibonacci example). This alone allows for significant optimizations. There are other implicit assumptions that people rely on while implementing their runtimes that give some (possibly non-trivial) benefit.
All of this said - we'd be more than happy to accept pull requests for special case scheduler implementations that a user could explicitly ask for in certain, well defined situations.
Yes of course, for example Staccato requires at the moment to declare the computation as a task and also indicate at compile-time how many tasks an async computation can spawn (for example fibonacci is 1 if we only fork(n-1) or 2 if we fork(n-2). This is used with great effect to be able to use a StackAllocator. More special case: fibril requires x86-64 calling convention + assembly to implement resumable tasks from a cactus stack and also mmap/madvise which doesn't work on Windows (but can be worked around probably).
And obviously both assume shared-memory.
Now regarding my own runtime, there are 2 keys part:
-
Batching stealed tasks so that the overhead is amortized on many tasks. This is covered in the following thesis "Embracing Explicit Communication in Work-Stealing Runtime Systems" by Andreas Prell (@aprell). C implementation at https://github.com/aprell/tasking-2.0, the important part is an adaptative switch between steal-one and steal-half of the tasks: https://github.com/aprell/tasking-2.0/blob/303ce3e09a7c118241b9d5482219e3e3b848212e/src/runtime.c#L791-L828.
-
Avoiding allocations, either by caching mechanism or using alloca for nested futures until it is actually required to allocate them because they are about to escape their scopes (for example they are being stolen by another thread).
I'm not aware of any other implicit tradeoffs in my own runtime though the new scheduler from Intel/Julia Partr does raise some intriguing points at 12:30 of this presentation (https://youtu.be/YdiZa0Y3F3c?t=751) and in their forum (https://discourse.julialang.org/t/question-regarding-julias-planned-approach-to-composable-safe-and-easy-multithreading/17692/4)
Thanks for the links!
You're welcome, I actually have plenty more here but I fear there is no enough time in a year to digest them all: https://github.com/numforge/laser/blob/master/research/runtime_threads_tasks_allocation_NUMA.md
I was re-reading on old comparison of multithreading runtimes from Sandia National Laboratories, a US-led scientific program, at https://prod-ng.sandia.gov/techlib-noauth/access-control.cgi/2012/1210594.pdf
They had to remove HPX (and GCC OpenMP and Nemesis QThreads) to analyze the other runtimes (Cilk, Intel OpenMP and Intel TBB). It would be nice to address that.

Let's work on improving these particular use cases in HPX first. Then I'd be more than happy to have those comparisons done again. That comparison was made a long time ago, I think we can do much better nowadays.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.