NorESM icon indicating copy to clipboard operation
NorESM copied to clipboard

Do not use nodes b1373-b1375,b1382 on betzy

Open mvertens opened this issue 1 year ago • 0 comments

I have suddenly started to experience unexpected crashes on betzy. I am getting the following type of traceback repeatedly using the nodes b1373-b1375,b1382

208: [b1374:545909:0:545909] ud_ep.c:278 Fatal: UD endpoint 0xaff0a40 to : unhandled timeout error 208: ==== backtrace (tid: 545909) ==== 208: 0 0x000000000005e810 uct_ud_ep_deferred_timeout_handler() .....

When I excluded these nodes from the submission the model ran. I have notified sigma2 about this. For noresm2_5_alpha07 - to exclude nodes from a job - the easies thing to do is to edit your $SRCROOT/ccsm_config/machines/betzy/env_batch.xml and add the following line below

  <directives>
    <directive> --ntasks={{ total_tasks }}</directive>
    <directive> --export=ALL</directive>
    <directive> --switches=1</directive>
    <directive> --exclude=b1373,b1374,b1375,b1382</directive> <=== add this line
  </directives>

mvertens avatar Oct 30 '24 14:10 mvertens