Nitin Bhat

Results 22 comments of Nitin Bhat

I get past the registration error when I run with the nonsmp version. However, the run crashes after step 3553.875. ``` 33237 Step: 3553.875000 Time: 0.286602 Rungs 3 to 4....

@trquinn I was able to reproduce the memory registration error that you were seeing on 2 nodes/4 processes while running the dwf1b benchmark. ``` [Orb3dLB_notopo] sorting *************************** Orb3dLB_notopo stats: maxObjLoad...

@brminich: Yes, I was running that on Frontera. Sure. 1) Build charm (ChaNGa target) using `./build ChaNGa ucx-linux-x86_64 smp --enable-error-checking --suffix=debug --basedir= -j24 -g -O0` 2) Download ChaNGa from https://github.com/N-BodyShop/changa...

@brminich Yes, it crashes every time I run on Frontera. It takes about 14 mins to crash. How many nodes did you run it on? 4 nodes? On trying with...

Okay, I think you can run it on 4 nodes (with 28 cores each) to better suit the 2 Frontera nodes (with 56 cores each). Yes, in some runs, I...

@brminich Yes, let's schedule a debugging session sometime next week if that works for you? @trquinn Okay, I can check if that is making a difference.

@brminich: Do you have any insights as to what might be happening here? (Or the linked issue on the UCX repo openucx/ucx#5291)

I also saw it on UCX builds as well IIRC, on Frontera. (that's the reason I tested this on my local machine and saw the incorrect values with `mpi-smp` on...

> tag @nitbhat @tarudoodi for review Sorry the delay in getting back on this. Reviewed, the changes look good. Do the CI tests exercise multi-nic code? Otherwise, it'll be good...

> Does this fix the issue @nitbhat ? I'll try this today on the aws cluster and let you know.