openpbs
openpbs copied to clipboard
prevent infinite reject logging if job is in prerun while deleting
Describe Bug or Feature
The server can get into an infinite loop logging messages like:
04/17/2024 09:41:34;0008;Server@pbs-m1;Job;216551.pbs-m1.metacentrum.cz;MOM rejected signal during delete (15001)
The server is not "stuck" because the loop goes through the work task. So it can be terminated. It logs the messages one after another... we got a 20+GB log size before noticing.
It can happen if a job does not reach the mom (e.g. due to an error on mom) and remains in PRERUN state while an attempt to delete the job is invoked.
Describe Your Change
- A delay was added to resending the signal to a job. It only makes sense to wait a moment before the resend is issued. This should give the job some time to reach the mom. I suggest 5 seconds resend time.
- The number of attempts is counted so an infinite loop can not happen. I suggest removing the job after 10 attempts anyway.
Link to Design Doc
Attach Test and Valgrind Logs/Output
This happened twice on our server but it is not easy to reproduce, so I made a fake situation with gdb to test the new code:
Thread 1 "pbs_server.bin" hit Breakpoint 1, post_delete_mom1 (pwt=0x5620c7121310) at req_delete.c:1592
1592 int tries = 0;
(gdb) n
1594 preq_sig = pwt->wt_parm1;
(gdb)
1595 rc = preq_sig->rq_reply.brp_code;
(gdb)
1597 if (rc && pwt->wt_aux && (rc < PBSE_) && (pwt->wt_aux > PBSE_))
(gdb)
1599 auxcode = preq_sig->rq_reply.brp_auxcode;
(gdb)
1600 preq_clt = preq_sig->rq_extra;
(gdb)
1601 if (preq_clt == NULL) {
(gdb)
1606 pjob = find_job(preq_sig->rq_ind.rq_signal.rq_jid);
(gdb)
1607 if (pjob == NULL) {
(gdb)
1614 release_req(pwt);
(gdb)
1617 if (rc) {
(gdb) set rc = 15001
(gdb) call set_job_substate(pjob, 41)
(gdb) n
1619 sprintf(log_buffer, "MOM rejected signal during delete (%d)", rc);
(gdb)
1621 pjob->ji_qs.ji_jobid, log_buffer);
(gdb)
1620 log_event(PBSEVENT_JOB, PBS_EVENTCLASS_JOB, LOG_INFO,
(gdb)
1623 if (rc == PBSE_UNKSIG) {
(gdb)
1638 } else if (rc == PBSE_UNKJOBID) {
(gdb)
1639 if (check_job_substate(pjob, JOB_SUBSTATE_PRERUN)) {
(gdb)
1644 if (preq_clt->rq_reply.brp_count < 10) {
(gdb)
1645 preq_clt->rq_reply.brp_count++;
(gdb)
1646 rc = delayed_issue_signal(pjob, sigtj, post_delete_mom1, preq_clt, 5);
(gdb) c
Continuing.
Thread 1 "pbs_server.bin" hit Breakpoint 1, post_delete_mom1 (pwt=0x5620c713e5c0) at req_delete.c:1592
1592 int tries = 0;
(gdb) n
1594 preq_sig = pwt->wt_parm1;
(gdb)
1595 rc = preq_sig->rq_reply.brp_code;
(gdb)
1597 if (rc && pwt->wt_aux && (rc < PBSE_) && (pwt->wt_aux > PBSE_))
(gdb)
1599 auxcode = preq_sig->rq_reply.brp_auxcode;
(gdb)
1600 preq_clt = preq_sig->rq_extra;
(gdb)
1601 if (preq_clt == NULL) {
(gdb)
1606 pjob = find_job(preq_sig->rq_ind.rq_signal.rq_jid);
(gdb)
1607 if (pjob == NULL) {
(gdb)
1614 release_req(pwt);
(gdb)
1617 if (rc) {
(gdb) set rc = 15001
(gdb) call set_job_substate(pjob, 41)
(gdb) n
1619 sprintf(log_buffer, "MOM rejected signal during delete (%d)", rc);
(gdb)
1621 pjob->ji_qs.ji_jobid, log_buffer);
(gdb)
1620 log_event(PBSEVENT_JOB, PBS_EVENTCLASS_JOB, LOG_INFO,
(gdb)
1623 if (rc == PBSE_UNKSIG) {
(gdb)
1638 } else if (rc == PBSE_UNKJOBID) {
(gdb)
1639 if (check_job_substate(pjob, JOB_SUBSTATE_PRERUN)) {
(gdb)
1644 if (preq_clt->rq_reply.brp_count < 10) {
(gdb)
1645 preq_clt->rq_reply.brp_count++;
(gdb)
1646 rc = delayed_issue_signal(pjob, sigtj, post_delete_mom1, preq_clt, 5);
(gdb) p preq_clt->rq_reply.brp_count
$1 = 2
Thanks for adding the server attribute. Code looks good.