pha4pgsql throttle_handle_load: High CPU load detected导致pgsql

throttle_handle_load: High CPU load detected导致pgsql_monitor超时pg重启

Open ChenHuajun opened this issue 8 years ago • 0 comments

执行一个delete从500w的表里删除13w条记录，发现Master挂了。查看日志，是RA monitor超时，认为资源失败。

Pacemaker 在发生Timed Out后重启postgres，迁移阈值是3，可以重启3次。 /var/log/messages

Nov 18 13:22:05 node1 expgsql(pgsql)[7774]: INFO: Stopping PostgreSQL on demote.
Nov 18 13:22:05 node1 expgsql(pgsql)[7774]: INFO: server shutting down
Nov 18 13:22:11 node1 expgsql(pgsql)[7774]: INFO: PostgreSQL is down
Nov 18 13:22:11 node1 expgsql(pgsql)[7774]: INFO: Changing pgsql-status on node1 : PRI->STOP.
Nov 18 13:22:11 node1 expgsql(pgsql)[8609]: INFO: PostgreSQL is already stopped.
Nov 18 13:22:12 node1 expgsql(pgsql)[8714]: INFO: Set all nodes into async mode.
Nov 18 13:22:12 node1 expgsql(pgsql)[8714]: INFO: server starting
Nov 18 13:22:12 node1 expgsql(pgsql)[8714]: INFO: PostgreSQL start command sent.
Nov 18 13:22:12 node1 expgsql(pgsql)[8714]: INFO: PostgreSQL is down
Nov 18 13:22:13 node1 expgsql(pgsql)[8714]: INFO: PostgreSQL is started.

corosync.log中发现在发生超时前，系统负载很高。 /var/log/cluster/corosync.log

Nov 18 13:20:55 [2353] node1       crmd:     info: throttle_handle_load:	Moderate CPU load detected: 12.060000
Nov 18 13:20:55 [2353] node1       crmd:     info: throttle_send_command:	New throttle mode: 0010 (was 0001)
Nov 18 13:21:25 [2353] node1       crmd:   notice: throttle_handle_load:	High CPU load detected: 16.379999
Nov 18 13:21:25 [2353] node1       crmd:     info: throttle_send_command:	New throttle mode: 0100 (was 0010)
Nov 18 13:21:44 [2350] node1       lrmd:  warning: child_timeout_callback:	pgsql_monitor_3000 process (PID 4822) timed out
Nov 18 13:21:44 [2350] node1       lrmd:  warning: operation_finished:	pgsql_monitor_3000:4822 - timed out after 60000ms
Nov 18 13:21:44 [2353] node1       crmd:    error: process_lrm_event:	Operation pgsql_monitor_3000: Timed Out (node=node1, call=837, timeout=60000ms)
Nov 18 13:21:44 [2348] node1        cib:     info: cib_process_request:	Forwarding cib_modify operation for section status to master (origin=local/crmd/462)

Pacemaker输出系统负载高告警，可能由于IO wait高 http://clusterlabs.org/pipermail/users/2015-May/000518.html

有人遇到类似的问题，处理办法就是增大monitor的超时时间。 https://bugs.launchpad.net/fuel/+bug/1464131 https://review.openstack.org/#/c/191715/1/deployment/puppet/pacemaker_wrappers/manifests/rabbitmq.pp

在线修改的方法如下： pcs resource update pgsql op monitor interval=4s timeout=180s on-fail=restart pcs resource update pgsql op monitor role=Master timeout=180s on-fail=restart interval=3s

Nov 18 '16 14:11 ChenHuajun

pha4pgsql pha4pgsql copied to clipboard

throttle_handle_load: High CPU load detected导致pgsql_monitor超时pg重启

pha4pgsql
pha4pgsql copied to clipboard