Thomas Quinn

Results 26 comments of Thomas Quinn

Following the suggestion in #2635 , I tried running ChaNGa with the master branch of ucx. With dwf1b running on 2 nodes/4 processes, I get the failure: ``` [1589667781.056885] [c161-001:28740:0]...

> @trquinn: How can I get access to the dwf1b benchmark? Is it the same as dwf1.6144 as listed in https://github.com/N-BodyShop/changa/wiki/ChaNGa-Benchmarks? Correct: that benchmark can be downloaded from google drive.

> I tried the `h148.cosmo50PLK.6144g3HbwK1BH.param` benchmark on 64 nodes with 2 processes/node, built on charm that was built using ucx master, and I see the crash which is the same...

I checked on expected memory use: when running on a single SMP process, this benchmark uses 16.3GB. Using netlrts with 4 SMP processes, the benchmark uses 5.3GB/process (i.e., ~22GB total).

They just upgraded the ofed libraries (and the system installed UCX) on Frontera. We should see if that makes a difference first. I'm in meetings until 14:30 PDT all next...

Note that a similar issue is reported on in the UCX repository: https://github.com/openucx/ucx/issues/5291

I've done a little more investigation on frontera, using the master branches of ucx and charm (as of Aug. 18), and the dwf1b benchmark, running 8 processors on 4 nodes....

I just tried with UCX v1.9.0 and Charm v6.11.0-beta. The issue still occurs.

Reproduction: so far I've only seen this on a 128x20 core run with a 8GB input dataset. I'll try on something smaller. Other symptoms include getting zeros for arguments in...