noel

Results 217 comments of noel

On frontier, using `tcclevenger/simulations/cess-production-cherry-pick-merges` (with team barrier), I'm able to run ne256 longer. Currently at `model date = 20191113` `/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/tcessr-sep15/cess-v1-cntl.ne256pg2_ne256pg2.F2010-SCREAMv1.tcessr-sep15.n0096t8x111661.tb.noSC.newy`

Why would we want to source `/global/cfs/cdirs/e3sm/eamxx-ml/python_venv/3.9.13/screamML/bin/activate`? Can this only be done for certain experiments? It makes me uneasy to change the environment

Just tracking job info. ``` The two jobs that crashed for me: jobid 1448414 failed with first error message on rank 14424 or node frontier08656 jobid 1449627 failed with first...

Ha. I was able to reproduce the issue using only 1 node with ne30. I requested to run on the "bad node". `/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/cess-sep28/cess-v2-cntl.ne30pg2_ne30pg2.F2010-SCREAMv1.cess-sep28.n0001t8x111661.withfrontier08656`

I've only recvd automated response email and see the ticket OLCFHELP-14845 was created. In general, requesting to avoid a certain node can only increase Q wait time, but I don't...

Luca B, Chris T, and I have been trying to debug this. Unable to find a reproducer at lower res than ne1024. Have tried a few other things without success...

Note that for a recent cess run (using the cess branch) on frontier, we forgot to include the restart force hack for some yaml files and the error seen in...

We used it for the Cess runs on frontier. I am not sure if there may have been change to master that might have addressed it. Was used in these...

The above was with `bartgol/eamxx/share-horiz-remap-data` checked out on Jan24. Trying same two setup with a scream master of Jan23, I get the same diffs -- ie diverges at same place...

I had incorrectly assumed that a previous checkout was BFB. Oddly, there is indeed non-bfb diffs in the hash prints, but there are only 3 of them -- which seems...