noel

Results 217 comments of noel

I think the parallel region is started higher up. I'm putting print statements around the code (seen below) and I see these arrays are allocated within a threaded region. But...

Well now (with master of June 17), I get a diff error for same test: ``` 0: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. 0: 0: Backtrace...

Both of the failing tests are passing in @bartgol branch `bartgol/eamxx/use-only-scorpio-clib`

what about running more on pm-gpu?

Yes it is certainly possible that root cause is before this point. I did capture the stack of another process (MPI) and it showed it was also in the same...

I'm a big fan of adding output that gives us more info about what's happening. Esp in places that typically take a long time.

I'm trying some cases with the above logging changes you suggested.

I've been running cases using checkouts that contain ``` 4c1ce5cf31 2024-01-23 10:33:13 -0700 Merge pull request #2646 from E3SM-Project/bartgol/eamxx/spa-use-horiz-interp-remapper ``` that Luca suggested *might* impact hanging. So far, I have...

Just updating this issue: I've been doing some autotune testing with a Jan25th checkout (and a few others) on frontier and have yet to see the flavor of hang noted...

Note that we ran 1 year with ne256 on pm-gpu (both control and plus 4K). That was a March27th checkout. I don't think we've changed the macmic in a while....