NorESM
NorESM copied to clipboard
Do not use nodes b1373-b1375,b1382 on betzy
I have suddenly started to experience unexpected crashes on betzy. I am getting the following type of traceback repeatedly using the nodes b1373-b1375,b1382
208: [b1374:545909:0:545909] ud_ep.c:278 Fatal: UD endpoint 0xaff0a40 to : unhandled timeout error 208: ==== backtrace (tid: 545909) ==== 208: 0 0x000000000005e810 uct_ud_ep_deferred_timeout_handler() .....
When I excluded these nodes from the submission the model ran. I have notified sigma2 about this.
For noresm2_5_alpha07 - to exclude nodes from a job - the easies thing to do is to edit your $SRCROOT/ccsm_config/machines/betzy/env_batch.xml and add the following line below
<directives>
<directive> --ntasks={{ total_tasks }}</directive>
<directive> --export=ALL</directive>
<directive> --switches=1</directive>
<directive> --exclude=b1373,b1374,b1375,b1382</directive> <=== add this line
</directives>