chef-server
chef-server copied to clipboard
Bookshelf database cleanup tasks will never complete if they fall behind
Chef Infra Server 13.2 + Chef Backend 2.1, RHEL 7.8
bookshelf has two database cleanup tasks, do_delete_cleanup + do_upload_cleanup: https://github.com/chef/chef-server/blob/ac11df3d7c88b164a16beccdd18c12d403e87967/src/bookshelf/src/bksw_cleanup_task.erl#L79-L97
these tasks run every 7 and 19 minutes, respectively: https://github.com/chef/chef-server/blob/ac11df3d7c88b164a16beccdd18c12d403e87967/src/bookshelf/src/bksw_cleanup_task.erl#L52-L53
sqerl defaults to a 5 second postgres statement_timeout: https://github.com/chef/sqerl/blob/master/src/sqerl_client.erl#L68
If they take more than 5 seconds, bookshelf crashes:
2020-08-06 00:00:10 =SUPERVISOR REPORT====
Supervisor: {local,bksw_sup}
Context: child_terminated
Reason: {timeout,{gen_server,call,[bksw_cleanup_task,delete_cleanup]}}
Offender: [{pid,<0.5948.224>},{id,bksw_cleanup_task},{mfargs,{bksw_cleanup_task,start_link,[]}},{restart_type,permanent},{shutdown,brutal_kill},{child_type,worker}]
This will continue forever until the cleanup is done manually, e.g.
echo "select from purge_expired(420000)\;" | su -l chef_pgsql -c 'psql bookshelf'
Other than the database size growing over time, we aren't sure of the side-affects of this behavior at this time.
Aha! Link: https://chef.aha.io/features/SH-3062
This will result in 500s, 503s for cookbook related tasks while the bookshelf process is being restarted. Uploads/downloads both are affected. In a busy system, its disruptive.
Do we actually see the bookshelf process itself restarting? I ask because the supervisor configuration in bookshelf is:
https://github.com/chef/chef-server/blob/master/src/bookshelf/src/bksw_sup.erl#L44-L47
Which should mean that the supervisor won't shutdown until there have been 1000 restarts inside a given hour interval.
Some quick thoughts.
I no longer remember how we choose that cleanup interval, but a first order mitigation would be to call it more often to avoid getting into a bad state.
A better fix might be to add a LIMIT XXX to the select here: https://github.com/chef/chef-server/blob/ac11df3d7c/src/bookshelf/schema/deploy/smart_delete.sql#L38, where XXX is chosen to be able to complete before the 5Sec timeout.
Better yet would be to make that LIMIT a parameter of the stored procedure, and then alter the do_upload_cleanup routine to recurse until we clean up fewer than that LIMIT.
I'd avoid globally changing the statement_timeout in sqerl, because I think that will have broader downstream effects.
@ssd
Do we actually see the bookshelf process itself restarting? I ask because the supervisor configuration in bookshelf is:
https://github.com/chef/chef-server/blob/master/src/bookshelf/src/bksw_sup.erl#L44-L47
Which should mean that the supervisor won't shutdown until there have been 1000 restarts inside a given hour interval.
Yes, I have seen the bookshelf process restarting. I did not notice the bookshelf supervisor restarting. No results from this, in a time period where the child was crashing lots, so seems like the supervisor itself remained up the whole time
grep Context: */*/var/log/opscode/bookshelf/crash.log | grep -v child_terminated