pha4pgsql
pha4pgsql copied to clipboard
throttle_handle_load: High CPU load detected导致pgsql_monitor超时pg重启
执行一个delete从500w的表里删除13w条记录,发现Master挂了。查看日志,是RA monitor超时,认为资源失败。
Pacemaker 在发生Timed Out后重启postgres,迁移阈值是3,可以重启3次。 /var/log/messages
Nov 18 13:22:05 node1 expgsql(pgsql)[7774]: INFO: Stopping PostgreSQL on demote.
Nov 18 13:22:05 node1 expgsql(pgsql)[7774]: INFO: server shutting down
Nov 18 13:22:11 node1 expgsql(pgsql)[7774]: INFO: PostgreSQL is down
Nov 18 13:22:11 node1 expgsql(pgsql)[7774]: INFO: Changing pgsql-status on node1 : PRI->STOP.
Nov 18 13:22:11 node1 expgsql(pgsql)[8609]: INFO: PostgreSQL is already stopped.
Nov 18 13:22:12 node1 expgsql(pgsql)[8714]: INFO: Set all nodes into async mode.
Nov 18 13:22:12 node1 expgsql(pgsql)[8714]: INFO: server starting
Nov 18 13:22:12 node1 expgsql(pgsql)[8714]: INFO: PostgreSQL start command sent.
Nov 18 13:22:12 node1 expgsql(pgsql)[8714]: INFO: PostgreSQL is down
Nov 18 13:22:13 node1 expgsql(pgsql)[8714]: INFO: PostgreSQL is started.
corosync.log中发现在发生超时前,系统负载很高。 /var/log/cluster/corosync.log
Nov 18 13:20:55 [2353] node1 crmd: info: throttle_handle_load: Moderate CPU load detected: 12.060000
Nov 18 13:20:55 [2353] node1 crmd: info: throttle_send_command: New throttle mode: 0010 (was 0001)
Nov 18 13:21:25 [2353] node1 crmd: notice: throttle_handle_load: High CPU load detected: 16.379999
Nov 18 13:21:25 [2353] node1 crmd: info: throttle_send_command: New throttle mode: 0100 (was 0010)
Nov 18 13:21:44 [2350] node1 lrmd: warning: child_timeout_callback: pgsql_monitor_3000 process (PID 4822) timed out
Nov 18 13:21:44 [2350] node1 lrmd: warning: operation_finished: pgsql_monitor_3000:4822 - timed out after 60000ms
Nov 18 13:21:44 [2353] node1 crmd: error: process_lrm_event: Operation pgsql_monitor_3000: Timed Out (node=node1, call=837, timeout=60000ms)
Nov 18 13:21:44 [2348] node1 cib: info: cib_process_request: Forwarding cib_modify operation for section status to master (origin=local/crmd/462)
Pacemaker输出系统负载高告警,可能由于IO wait高 http://clusterlabs.org/pipermail/users/2015-May/000518.html
有人遇到类似的问题,处理办法就是增大monitor的超时时间。 https://bugs.launchpad.net/fuel/+bug/1464131 https://review.openstack.org/#/c/191715/1/deployment/puppet/pacemaker_wrappers/manifests/rabbitmq.pp
在线修改的方法如下: pcs resource update pgsql op monitor interval=4s timeout=180s on-fail=restart pcs resource update pgsql op monitor role=Master timeout=180s on-fail=restart interval=3s