chapel
chapel copied to clipboard
Figure out a story for gasnet-ibv multi-rail
At a high level some InfiniBand systems (including Summit) have multiple rails and getting peak bandwidth requires communicating over all rails. By default gasnet will only use a single rail, but there's a way to opt-in to multi-rail support. This support can introduce other overheads so it's not necessarily something we want to enable by default either, but currently enabling it requires changing gasnet configure time options, which isn't very friendly to Chapel users.
Some more details from the GASNet ibv README:
Terminology:
This document will use "NIC" (short for Network Interface Card) to refer to the physical object installed in a host and "connector" to refer to the external connections from the NIC to a network.
The term "HCA" (short for Host Channel Adapter) will be used to refer to a device as enumerated by
ibv_get_device_list()or the command-line utilityibv_devices. The HCA is the device driver's logical representation of the NIC, but there is not always a one-to-one correspondence, as described next.
When a NIC has multiple connectors, the driver may present these either as a single HCA with multiple "ports" or as multiple single-port HCAs. Additionally, some systems will present more than one HCA per connector. This is typically done on systems where the NIC is connected to multiple I/O buses. On a compute node of the Summit system at OLCF, there are two external network cable connectors on a single NIC which is connected internally to two I/O buses. The driver presents four HCAs, one for each combination of external connector and internal I/O bus. So, for a single NIC with two connectors there are at least three ways the system may present the same resources: with 1, 2 or 4 HCAs.
Multi-rail:
By default, ibv-conduit will use only the first active port on the first active InfiniBand Host Channel Adapter (HCA). However, if more than one HCA port is enabled for use, ibv-conduit will stripe communications over them.
The use of multiple ports or multiple adapters will yield increases in both bandwidth (good) and software overhead (bad). How the resulting trade off works for a given application may be hard to predict.
Our own nightly 16-node-cs-hdr machine has dual rail for the CascadeLake partition. The last time I ran correctness testing with GASNet-EX 2021.3.0 I saw sporadic failures. Using fenced puts (see GASNET_USE_FENCED_PUTS in the gasnet README) did not help the situation.
I think we need to understand the cause of these correctness regressions and evaluate the performance impact of using multiple rails. Once that is done we'll need to figure out a more friendly user-facing way to have Chapel users opt in to using multiple rails.
I should also note that by default GASNet will warn if it detects multiple rails, but the user hasn't opted into that. Given the current correctness issues we requested a way to quiet this warning in https://upc-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4246 and are planning to quiet the warning for our users until this issue is addressed.
I am trying to pick up where @ronawho left off on this issue.
I have successfully built Chapel with CHPL_GASNET_MORE_CFG_OPTIONS='--with-ibv-max-hcas=2 --enable-trace' and run ep and mg tests using both rails on a pair of multi-rail nodes (as verified via the --enable-trace support). There were no "sporadic failures" in this very limited testing.
Some information on the original failures would be appreciated.
That's encouraging, thanks for the report, Paul!
@ronawho, can you let us know what you remember here? Like did you do a full CHPL_COMM=gasnet paratest and had a few failures across the full suite? Different every time, or not necessarily reproducible when running a second time?
I think segfaults for some tests in the release/examples or runtime/configMatters dirs (default set of tests I would run for multi-locale testing). I'm surprised I didn't list which tests failed or include the full log. That I didn't capture that makes me suspect simple tests (like maybe even hello) were failing, but I really don't remember.
This all predated co-locales FWIW, so it was 1 process trying to use multiple rails.
Thanks, @ronawho I am initially only testing without co-locales, fwiw.
Just to let people my current thinking here:
If we find that there are inherent problems with using multi-rail with a single locale per node, and we don't believe those issues will be easy to fix, or that the fixes would hurt performance more than we'd like, I'd be strongly in favor of shelving that mode to focus on the co-locale support for multi-rail (that is, a rail per co-locale). That is, I'd imagine having the single-locale-per-node mode simply not use multiple rails, at least in the short-term. Basically, I think co-locales are the right way to handle this long-term—particularly given their other benefits—and my interest in a co-locale-free solution was simply based on the hope that it would "just work" and benefit users before being reminded of the history here.
On the other hand, if single-locale-per-node can use, and benefit from, multiple rails per node, that'd be great to know. Even then, I'd still be interested in pursuing the co-locale approach in order to compare the two approaches, and also to understand whether giving each co-locale its own rail vs. having them share both rails would be better for our benchmarks' performance (also because I consider co-locales to be the future in general).
I was able to return to this yesterday (Mon, Nov 4).
I have overcome previous problems getting make text-venv to complete, and was able to run start_test in the test/release/examples directory, with a Chapel build with the following explicit settings:
export CHPL_LLVM=bundled
export CHPL_RE=bundled
export CHPL_GMP=bundled
export CHPL_TARGET_CPU=native
export CHPL_COMM=gasnet
export CHPL_COMM_SUBSTRATE=ibv
export CHPL_GASNET_SEGMENT=large
export CHPL_GASNET_MORE_CFG_OPTIONS='--with-ibv-max-hcas=2 --enable-pshm'
I have access to only a pair of dual-HCA nodes (gomez03 and gomez04), and many of the tests want to use 4 locales. This required some "extra" settings, including the --enable-pshm above (avoids warnings which lead to spurious failures to match test output). The following environment variables at runtime were used to disable shared-memory communication and to avoid ssh-spawner warnings about requests for 3 or 4 nodes when only 2 are available:
export GASNET_SUPERNODE_MAXSIZE=1
export GASNET_SSH_SERVERS=gomez03,gomez04,gomez03,gomez04,gomez03,gomez04,gomez03,gomez04
export GASNET_SSH_KEEPDUP=1
Note that I did not make proper use of co-locales (yet) in this experiment.
The testing on dual-HCA nodes ran to completion without issues:
[Done with tests - 241105.123815]
[Log file: /home/ac.phargrove/Chapel/SRC/test/Logs/ac.phargrove.linux64.log ]
[Skipped 2 tests with .stdin input]
[Test Summary - 241105.123815]
[Warning: Could not find chpldoc, skipping test release/examples/primers/chpldoc.doc]
[Summary: #Successes = 143 | #Failures = 0 | #Futures = 0 | #Warnings = 1 ]
[Summary: #Passing Suppressions = 0 | #Passing Futures = 0 ]
[END]
Summary:
In my recent experience use of CHPL_GASNET_MORE_CFG_OPTIONS='--with-ibv-max-hcas=2' was sufficient to enable use of both HCAs (verified in a build with tracing enabled), and the release/examples tests ran without incident in my non-co-locale testing.
Tests with CHPL_RT_LOCALES_PER_NODE=2 will likely follow as soon as time allows.
However, that will (as in this set of tests) use both HCAs in both locales, which may not be the eventual "best practice".
Tests with CHPL_RT_LOCALES_PER_NODE=2 will likely follow as soon as time allows. However, that will (as in this set of tests) use both HCAs in both locales, which may not be the eventual "best practice".
Those passed with the addition of 5 "Error matching program output" failures, corresponding to diffs like the following for tests using 5 or more locales:
> warning: The node has more locales (3) than co-locales (2).
> Considering the node oversubscribed.
I want to make clear that while I have been unable to reproduce the failures which prompted this issue, I cannot point to any change in code (in GASNet-EX or Chapel runtime) or difference(s) in the system which can account for this difference. So, I do not think it is safe to assume that failures are not possible based on just my limited testing.
Thanks for these updates, Paul!