E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

Error of e3sm_timing_stats file for case with NINST > 1

Open jiamenglai opened this issue 1 month ago • 5 comments

I am running model with multiple instances. The case can run and generate output smoothly but cannot stop correctly. The error is about the file name of e3sm_timing_stats (see below). I have checked that the e3sm_timing_stats file exist but there are multiple such files, each with file name followed by INST number (e.g., e3sm_timing_0001_stats...). This bug will not cause error in generating model output but will prevent the model resubmitting following jobs. I am wondering how to fix this. Thanks

2025-11-19 14:48:41 PRE_RUN_CHECK HAS FINISHED
run command is srun  --label  -n 10 -N 1 -c 2  --cpu-bind=cores   -m plane=128 /global/homes/j/jiamengl/FATEScase_v3.1.0-alpha-4304/GuyaFlux/GuyaFlux_2PT.inven.allom.d2bl.fixbug.2025-11-19/bld/e3sm.exe   >> e3sm.log.$LID 2>&1  
2025-11-19 14:48:44 SAVE_PRERUN_PROVENANCE BEGINS HERE
Setting resource.RLIMIT_STACK to -1 from (-1, -1)
2025-11-19 14:48:47 SAVE_PRERUN_PROVENANCE HAS FINISHED
2025-11-19 14:48:47 MODEL EXECUTION BEGINS HERE
2025-11-19 15:15:05 MODEL EXECUTION HAS FINISHED
2025-11-19 15:15:05 POST_RUN_CHECK BEGINS HERE
2025-11-19 15:15:05 POST_RUN_CHECK HAS FINISHED
2025-11-19 15:15:05 RUN_MODEL HAS FINISHED
2025-11-19 15:15:05 GET_TIMING BEGINS HERE
2025-11-19 15:15:06 GET_TIMING HAS FINISHED
2025-11-19 15:15:26 SAVE_POSTRUN_PROVENANCE BEGINS HERE
ERROR: /global/homes/j/jiamengl/FATEScase_v3.1.0-alpha-4304/GuyaFlux/GuyaFlux_2PT.inven.allom.d2bl.fixbug.2025-11-19/timing/e3sm_timing_stats.45402368.251119-144825 does not exists

jiamenglai avatar Nov 26 '25 20:11 jiamenglai

If you are only interested in a quick fix without solving the underlying problem, you can try playing with SAVE_TIMING (set it to false) to not even bother with SAVE_POSTRUN_PROVENANCE part.

In case you want to actually fix this, you likely need to propagate the correct NINST details into the corresponding code for SAVE_POSTRUN_PROVENANCE

mahf708 avatar Nov 26 '25 21:11 mahf708

How are you setting NINST btw? is this happening in the unidriver or multidriver multiinstance setup?

mahf708 avatar Nov 26 '25 21:11 mahf708

How are you setting NINST btw? is this happening in the unidriver or multidriver multiinstance setup?

I set NINST in ./create_newcase by adding '--ninst=25 --multi-driver'

I tried to set SAVE_TIMING as FALSE as suggested. The case finished without error, and a case was resubmitted. However, the resubmitted case failed:

ERROR: Command: '/global/common/software/nersc/pe/conda-envs/24.1.0/python-3.11/nersc-python/bin/xmllint --xinclude --noout --schema /global/u2/j/jiamengl/E3SM/cime/CIME/data/config/xml_schemas/env_entry_id.xsd /global/u2/j/jiamengl/FATEScase_v3.1.0-alpha-4304/GuyaFlux/GuyaFlux_2PT.gm.ocs.inven.allom.lru.dynamic.2025-11-25/env_run.xml' failed with error '/global/u2/j/jiamengl/FATEScase_v3.1.0-alpha-4304/GuyaFlux/GuyaFlux_2PT.gm.ocs.inven.allom.lru.dynamic.2025-11-25/env_run.xml:1: parser error : Document is empty

^' from dir '/global/u2/j/jiamengl/FATEScase_v3.1.0-alpha-4304/GuyaFlux/GuyaFlux_2PT.gm.ocs.inven.allom.lru.dynamic.2025-11-25'

I don't understand this error, as env_run.xml did exist in my case directory.

jiamenglai avatar Nov 27 '25 00:11 jiamenglai

never seen that xmllint error... sometimes odd things happen because of the conda env, but I can't be certain this is the issue. To debug you can try module load cray-python (instead of module load python).

does running the command itself return the error? In your shell, try:

/global/common/software/nersc/pe/conda-envs/24.1.0/python-3.11/nersc-python/bin/xmllint --xinclude --noout --schema /global/u2/j/jiamengl/E3SM/cime/CIME/data/config/xml_schemas/env_entry_id.xsd /global/u2/j/jiamengl/FATEScase_v3.1.0-alpha-4304/GuyaFlux/GuyaFlux_2PT.gm.ocs.inven.allom.lru.dynamic.2025-11-25/env_run.xml

If it erros, try this as a check (note I changed the path of xmllint to /usr/bin/xmllint)

/usr/bin/xmllint --xinclude --noout --schema /global/u2/j/jiamengl/E3SM/cime/CIME/data/config/xml_schemas/env_entry_id.xsd /global/u2/j/jiamengl/FATEScase_v3.1.0-alpha-4304/GuyaFlux/GuyaFlux_2PT.gm.ocs.inven.allom.lru.dynamic.2025-11-25/env_run.xml

If the second one works, then you'd know it is the python conda env you have loaded (nersc-python) is the issue.

(I can't run the above commands because I don't have permission to your directories)

mahf708 avatar Nov 27 '25 03:11 mahf708

Thanks for the suggestion. I can successfully run the command: /global/common/software/nersc/pe/conda-envs/24.1.0/python-3.11/nersc-python/bin/xmllint --xinclude --noout --schema /global/u2/j/jiamengl/E3SM/cime/CIME/data/config/xml_schemas/env_entry_id.xsd /global/u2/j/jiamengl/FATEScase_v3.1.0-alpha-4304/GuyaFlux/GuyaFlux_2PT.gm.ocs.inven.allom.lru.dynamic.2025-11-25/env_run.xml

Another bug is that sometimes (not every time) when I run NINST>1, I got the error: Cannot open file 'memory.3.86400.log': File exists. Usually when I delete the memory file and re-submit the case, it works. But is there any other way to fix this?

jiamenglai avatar Dec 03 '25 19:12 jiamenglai

@jiamenglai the memory part should be fixed now; and the issue in this pr will also be fixed soon. Would you mind testing and letting us know if the issues persist for you? thanks for reporting and helping us improve the model

mahf708 avatar Dec 12 '25 18:12 mahf708