openpbs icon indicating copy to clipboard operation
openpbs copied to clipboard

prevent infinite reject logging if job is in prerun while deleting

Open vchlum opened this issue 2 years ago • 1 comments

Describe Bug or Feature

The server can get into an infinite loop logging messages like:

04/17/2024 09:41:34;0008;Server@pbs-m1;Job;216551.pbs-m1.metacentrum.cz;MOM rejected signal during delete (15001)

The server is not "stuck" because the loop goes through the work task. So it can be terminated. It logs the messages one after another... we got a 20+GB log size before noticing.

It can happen if a job does not reach the mom (e.g. due to an error on mom) and remains in PRERUN state while an attempt to delete the job is invoked.

Describe Your Change

  • A delay was added to resending the signal to a job. It only makes sense to wait a moment before the resend is issued. This should give the job some time to reach the mom. I suggest 5 seconds resend time.
  • The number of attempts is counted so an infinite loop can not happen. I suggest removing the job after 10 attempts anyway.

Link to Design Doc

Attach Test and Valgrind Logs/Output

This happened twice on our server but it is not easy to reproduce, so I made a fake situation with gdb to test the new code:

Thread 1 "pbs_server.bin" hit Breakpoint 1, post_delete_mom1 (pwt=0x5620c7121310) at req_delete.c:1592
1592		int tries = 0;
(gdb) n
1594		preq_sig = pwt->wt_parm1;
(gdb) 
1595		rc = preq_sig->rq_reply.brp_code;
(gdb) 
1597		if (rc && pwt->wt_aux && (rc < PBSE_) && (pwt->wt_aux > PBSE_))
(gdb) 
1599		auxcode = preq_sig->rq_reply.brp_auxcode;
(gdb) 
1600		preq_clt = preq_sig->rq_extra;
(gdb) 
1601		if (preq_clt == NULL) {
(gdb) 
1606		pjob = find_job(preq_sig->rq_ind.rq_signal.rq_jid);
(gdb) 
1607		if (pjob == NULL) {
(gdb) 
1614		release_req(pwt);
(gdb) 
1617		if (rc) {
(gdb) set rc = 15001
(gdb) call set_job_substate(pjob, 41)
(gdb) n
1619			sprintf(log_buffer, "MOM rejected signal during delete (%d)", rc);
(gdb) 
1621				  pjob->ji_qs.ji_jobid, log_buffer);
(gdb) 
1620			log_event(PBSEVENT_JOB, PBS_EVENTCLASS_JOB, LOG_INFO,
(gdb) 
1623			if (rc == PBSE_UNKSIG) {
(gdb) 
1638			} else if (rc == PBSE_UNKJOBID) {
(gdb) 
1639				if (check_job_substate(pjob, JOB_SUBSTATE_PRERUN)) {
(gdb) 
1644					if (preq_clt->rq_reply.brp_count < 10) {
(gdb) 
1645						preq_clt->rq_reply.brp_count++;
(gdb) 
1646						rc = delayed_issue_signal(pjob, sigtj, post_delete_mom1, preq_clt, 5);
(gdb) c
Continuing.

Thread 1 "pbs_server.bin" hit Breakpoint 1, post_delete_mom1 (pwt=0x5620c713e5c0) at req_delete.c:1592
1592		int tries = 0;
(gdb) n
1594		preq_sig = pwt->wt_parm1;
(gdb) 
1595		rc = preq_sig->rq_reply.brp_code;
(gdb) 
1597		if (rc && pwt->wt_aux && (rc < PBSE_) && (pwt->wt_aux > PBSE_))
(gdb) 
1599		auxcode = preq_sig->rq_reply.brp_auxcode;
(gdb) 
1600		preq_clt = preq_sig->rq_extra;
(gdb) 
1601		if (preq_clt == NULL) {
(gdb) 
1606		pjob = find_job(preq_sig->rq_ind.rq_signal.rq_jid);
(gdb) 
1607		if (pjob == NULL) {
(gdb) 
1614		release_req(pwt);
(gdb) 
1617		if (rc) {
(gdb) set rc = 15001
(gdb) call set_job_substate(pjob, 41)
(gdb) n
1619			sprintf(log_buffer, "MOM rejected signal during delete (%d)", rc);
(gdb) 
1621				  pjob->ji_qs.ji_jobid, log_buffer);
(gdb) 
1620			log_event(PBSEVENT_JOB, PBS_EVENTCLASS_JOB, LOG_INFO,
(gdb) 
1623			if (rc == PBSE_UNKSIG) {
(gdb) 
1638			} else if (rc == PBSE_UNKJOBID) {
(gdb) 
1639				if (check_job_substate(pjob, JOB_SUBSTATE_PRERUN)) {
(gdb) 
1644					if (preq_clt->rq_reply.brp_count < 10) {
(gdb) 
1645						preq_clt->rq_reply.brp_count++;
(gdb) 
1646						rc = delayed_issue_signal(pjob, sigtj, post_delete_mom1, preq_clt, 5);
(gdb) p preq_clt->rq_reply.brp_count
$1 = 2

vchlum avatar Apr 18 '24 07:04 vchlum

Thanks for adding the server attribute. Code looks good.

bayucan avatar May 28 '24 17:05 bayucan