qthreads
qthreads copied to clipboard
Improve startup times when hwloc is used
When we originally switched our default tasking layer from fifo to qthreads we saw a pretty big increase in startup time. This can be seen in our perf graphs around 2014/08/25.
startup time isn't super important to us since we mostly care about performance of longer running highly parallel applications, but it does hurt our performance for some of the Computer Language Benchmarks Game codes, and it makes it much slower when we want to run sporadically failing tests a lot of times.
Note that this only seems to be an issue for topo={hwloc, hwloc_v2, binders}. We see startup times comparable to fifo if we use topo=sys. Here's some timings collected on a dual 12-core haswell machine (our "chapcs" configuration):
#include <assert.h>
#include "qthread/qthread.h"
int main() {
assert(qthread_initialize() == 0);
}
| topology | time (s) |
|---|---|
| hwloc | ~0.030s |
| hwloc_v2 | ~0.025s |
| binders | ~0.023s |
| sys | ~0.003s |
It's possible that hwloc takes a while to startup/query the system or that it takes a while to pin threads to cores. We'd like to have an explanation for why topo=hywloc is slower than topo=sys, even if the root cause is something beyond our control.
https://chapel.atlassian.net/browse/CHAPEL-278 has some more context from the chapel side of things and includes timings of no-op Chapel programs, which have a little more overhead that just qthreads/hwloc startup time.
Figuring out the cause of the difference is a medium priority item for us. Priority of resolving the performance difference (if possible) will probably depend on how much effort would be required.
I thought that you guys were looking at using the new binders topo layer that @stelleg had worked on -- not so?
Yeah, that's a good point. We do want to look at using binders, although I haven't had a chance to yet. FWIW using binders with QT_CPUBIND unset has roughly the same timings as hwloc_v2 (added that to the table.) I'll also update the table after I get a chance to run with QT_CPUBIND set.
It would still be useful for us to know where time is being spent in either the hwloc or binders layer (is it all the hwloc init and gather topo stuff, or is it the cpu set stuff, or something else)
Also, do you know if anybody else is using the binders layer?
Was looking into this a little bit and it appears that almost all of the startup time is spent in hwloc_topology_load
I found a couple other threads from HPX-5 and openmpi indicating that this is a known problem, but that there is a fixed cost to gathering information about the system. hwloc docs have some recommendations for limiting the cost: https://www.open-mpi.org/projects/hwloc/doc/v2.4.0/a00364.php
Their main suggestion is to basically tell hwloc to avoid gathering information for constructs you don't care about. From the linked doc page they suggest something like:
""" However, using hwloc comes with an overhead since a topology must be loaded before gathering information and binding tasks or memory. This overhead may be reduced by filtering useless information out of the topology. For instance the following code builds a topology that may only contain Cores, hardware threads (PUs), and NUMA nodes (Packages, Caches and Group objects are ignored). """
hwloc_topology_t topology;
hwloc_topology_init(&topology);
hwloc_topology_ignore_type(topology, HWLOC_OBJ_PACKAGE);
hwloc_topology_ignore_type(topology, HWLOC_OBJ_CACHE);
hwloc_topology_ignore_type(topology, HWLOC_OBJ_GROUP);
hwloc_topology_load(topology);
FYI hwloc 1.11.6 had some nice speed ups for topology discovery:
- "Make the Linux discovery about twice faster, especially on the CPU side, by trying to avoid sysfs file accesses as much as possible."
https://mail-archive.com/[email protected]/msg00095.html
That resulted in a ~20% improvement for startup time for us.
This is still of interest to us, though I don't know how much we can do about it.
I think the next things for me to try are upgrading to hwloc 2.X to see if there are any startup time improvements, and to see if there's any impact from using hwloc_topology_ignore_type.