openpbs
openpbs copied to clipboard
Crash on server restart if job_sort_formula is set
I'm currently testing potential job sorting policies on openpbs master and I am getting a crash when I restart the server if the job_sort_formula attribute is set. This occurs if set on the server or the scheduler at this time. This does not appear to be related to the size of the job_sort_formula, and I have been able to cause it by setting it to job_sort_formula = "eligible_time", with eligible_time enabled.
The error I see from the server log on startup is:
08/13/2021 20:29:55;0001;Server@pdwrichp-score-test-s1;Svr;Server@pdwrichp-score-test-s1;PBS server internal error (15011) in decode_attr_db, Action function failed for job_sort_formula attr, errn 15011
The crash occurs later in free_svrattrl when something that was unallocated gets freed (srcf/lib/Libattr/attr_func.c line 416 in free_svrattrl)
I am not seeing this with any of the custom resources I have set on the server while experimenting. So far I have only seen this with job_sort_formula.
I have been able to get the server to successfully restart by deleting the job_sort_formula key from the database by hand. When I looked at the attributes field in the database, it does appear to be showing the formula as I set it.
While the server is running, when job_sort_formula is set, it does appear to be otherwise picked up and used correctly and I have seen the formula evaluation messages in the logs when I have looked for them.
This crash does not happen if I build from the v20.0.0 tag and set job_sort_formula on the server and then restart PBS.
Paul looks like you are hitting two different problems, one causing the other. The first problem is that while recovering the formula from db, server tries to run it through Python interpreter. But during a restart, DB recovery happens before Python interpreter is loaded which makes the action function of job_sort_formula fail.
Now, the reason it crashes is that while recovering from DB PBS knows the number of attributes it has read from the database. But, if an action function fails for one of the attributes, server bails out in the middle and tries to free up all the previously decoded attributes. The problem is that the server tries to free up the same number of attributes as it read from the DB, but because one of the action functions failed, forcing the server to bail out, there is a mismatch in the number of attributes it has decoded and the number of attributes it is trying to free.
I've stumbled over this. As this is a new installation it was no problem to use a previous commit (from 2020-10-09 to be exact, as this one is known to work).
But this is a show stopper for everyone who needs job_sort_formula
.
Issue #2530 is likely the same as this issue.
How can the formula be broken for almost a year? PBSPro obviously can't be using OpenPBS source as its exclusive base.
#2533 Appears to fix this. I was able to get a clean restart in my test environment.