Thomas Quinn
Thomas Quinn
Following the suggestion in #2635 , I tried running ChaNGa with the master branch of ucx. With dwf1b running on 2 nodes/4 processes, I get the failure: ``` [1589667781.056885] [c161-001:28740:0]...
> @trquinn: How can I get access to the dwf1b benchmark? Is it the same as dwf1.6144 as listed in https://github.com/N-BodyShop/changa/wiki/ChaNGa-Benchmarks? Correct: that benchmark can be downloaded from google drive.
> I tried the `h148.cosmo50PLK.6144g3HbwK1BH.param` benchmark on 64 nodes with 2 processes/node, built on charm that was built using ucx master, and I see the crash which is the same...
I checked on expected memory use: when running on a single SMP process, this benchmark uses 16.3GB. Using netlrts with 4 SMP processes, the benchmark uses 5.3GB/process (i.e., ~22GB total).
They just upgraded the ofed libraries (and the system installed UCX) on Frontera. We should see if that makes a difference first. I'm in meetings until 14:30 PDT all next...
Note that a similar issue is reported on in the UCX repository: https://github.com/openucx/ucx/issues/5291
I've done a little more investigation on frontera, using the master branches of ucx and charm (as of Aug. 18), and the dwf1b benchmark, running 8 processors on 4 nodes....
Any chance this will be fixed in 6.11?
I just tried with UCX v1.9.0 and Charm v6.11.0-beta. The issue still occurs.
Reproduction: so far I've only seen this on a 128x20 core run with a 8GB input dataset. I'll try on something smaller. Other symptoms include getting zeros for arguments in...