chef-server icon indicating copy to clipboard operation
chef-server copied to clipboard

Bookshelf database cleanup tasks will never complete if they fall behind

Open btm opened this issue 5 years ago • 4 comments

Chef Infra Server 13.2 + Chef Backend 2.1, RHEL 7.8

bookshelf has two database cleanup tasks, do_delete_cleanup + do_upload_cleanup: https://github.com/chef/chef-server/blob/ac11df3d7c88b164a16beccdd18c12d403e87967/src/bookshelf/src/bksw_cleanup_task.erl#L79-L97

these tasks run every 7 and 19 minutes, respectively: https://github.com/chef/chef-server/blob/ac11df3d7c88b164a16beccdd18c12d403e87967/src/bookshelf/src/bksw_cleanup_task.erl#L52-L53

sqerl defaults to a 5 second postgres statement_timeout: https://github.com/chef/sqerl/blob/master/src/sqerl_client.erl#L68

If they take more than 5 seconds, bookshelf crashes:

2020-08-06 00:00:10 =SUPERVISOR REPORT====
     Supervisor: {local,bksw_sup}
     Context:    child_terminated
     Reason:     {timeout,{gen_server,call,[bksw_cleanup_task,delete_cleanup]}}
     Offender:   [{pid,<0.5948.224>},{id,bksw_cleanup_task},{mfargs,{bksw_cleanup_task,start_link,[]}},{restart_type,permanent},{shutdown,brutal_kill},{child_type,worker}]

This will continue forever until the cleanup is done manually, e.g.

echo "select from purge_expired(420000)\;" | su -l chef_pgsql -c 'psql bookshelf'

Other than the database size growing over time, we aren't sure of the side-affects of this behavior at this time.

Aha! Link: https://chef.aha.io/features/SH-3062

btm avatar Aug 10 '20 21:08 btm

This will result in 500s, 503s for cookbook related tasks while the bookshelf process is being restarted. Uploads/downloads both are affected. In a busy system, its disruptive.

sean-horn avatar Sep 01 '20 01:09 sean-horn

Do we actually see the bookshelf process itself restarting? I ask because the supervisor configuration in bookshelf is:

https://github.com/chef/chef-server/blob/master/src/bookshelf/src/bksw_sup.erl#L44-L47

Which should mean that the supervisor won't shutdown until there have been 1000 restarts inside a given hour interval.

stevendanna avatar Sep 01 '20 10:09 stevendanna

Some quick thoughts.

I no longer remember how we choose that cleanup interval, but a first order mitigation would be to call it more often to avoid getting into a bad state.

A better fix might be to add a LIMIT XXX to the select here: https://github.com/chef/chef-server/blob/ac11df3d7c/src/bookshelf/schema/deploy/smart_delete.sql#L38, where XXX is chosen to be able to complete before the 5Sec timeout.

Better yet would be to make that LIMIT a parameter of the stored procedure, and then alter the do_upload_cleanup routine to recurse until we clean up fewer than that LIMIT.

I'd avoid globally changing the statement_timeout in sqerl, because I think that will have broader downstream effects.

markan avatar Sep 04 '20 21:09 markan

@ssd

Do we actually see the bookshelf process itself restarting? I ask because the supervisor configuration in bookshelf is:

https://github.com/chef/chef-server/blob/master/src/bookshelf/src/bksw_sup.erl#L44-L47

Which should mean that the supervisor won't shutdown until there have been 1000 restarts inside a given hour interval.

Yes, I have seen the bookshelf process restarting. I did not notice the bookshelf supervisor restarting. No results from this, in a time period where the child was crashing lots, so seems like the supervisor itself remained up the whole time

grep Context: */*/var/log/opscode/bookshelf/crash.log  | grep -v child_terminated

sean-horn avatar Sep 11 '20 16:09 sean-horn