pacemaker Feature: support debugging of timed out resource agents

This is a supportability/debug feature request to help with diagnosing unexpected RA (resource agent) timeouts. Cases involving RA timeouts are one of our most commonly reported problems. Unfortunately there is no easy way to troubleshoot these especially if they are not readily repeatable/reproducible. This changeset adds functionality in to pacemaker such that a configurable external program/script can be executed just before pacemaker terminates the RA monitor hierarchy. The sample debugRA.sh script gathers ps information along with user/kernel stack-traces of all the processes in the hierarchy in the hope that it will give clues as to why the RA did not complete in the allotted time. Being an external script means that it is easily customised to perform other data gathering steps.

Pacemaker Alerts were considered and tested, however these are not synchronised to the termination of the RA process hierarchy so that by the time they run the problem processes have already disappeared.

The changes were actually tested in SLES15.1 with pacemaker-2.0.1 with a modified version of the ip(1) command (employed by Ipaddr2 RA). The modification permits a user injectable delay which will invoke the timeout function to execute.

Test Results

B4609009D:~ # ll -rt ra-d*
-rw-r----- 1 root root  95816 Apr 26 11:36 ra-debug_20210426_113614.log
-rw-r----- 1 root root 104251 Apr 26 12:03 ra-debug_20210426_120314.log
-rw-r----- 1 root root 104251 Apr 27 05:11 ra-debug_20210427_051100.log
B4609009D:~ #


B4609009D:~ # cat ra-deb*
Started: 1619451374.146883177
Passed pid 5710
   UID  RUID   PID  PPID  PGID  SESS S TT       PSR %CPU ELAPSED WCHAN  COMMAND
     0     0     2     0     0     0 S ?          7  0.0  347070 -      [kthreadd]
     0     0     4     2     0     0 S ?          0  0.0  347070 -        [kworker/0:0H]
     0     0     7     2     0     0 S ?          0  0.0  347070 -        [mm_percpu_wq]
........

     0     0 69500     1 69500 69500 S ?         16  0.3   10472 SyS_ep   corosync
     0     0 69506     1 69506 69506 S ?         62  0.0   10471 SyS_po   /usr/sbin/pacemakerd -f
    90    90 69507 69506 69507 69507 S ?         19  0.0   10471 SyS_po     /usr/lib/pacemaker/pacemaker-based
     0     0 69508 69506 69508 69508 S ?         12  0.0   10471 SyS_po     /usr/lib/pacemaker/pacemaker-fenced
     0     0 69509 69506 69509 69509 S ?         58  0.0   10471 -          /usr/lib/pacemaker/pacemaker-execd
     0     0  5710 69509  5710 69509 S ?         46  0.0      20 -            /bin/sh /usr/lib/ocf/resource.d/heartbeat/IPaddr2 monitor
     0     0  5722  5710  5710 69509 S ?         65  0.0      20 -              /bin/sh /usr/lib/ocf/resource.d/heartbeat/IPaddr2 monitor
     0     0  5725  5722  5710 69509 S ?          1  0.0      20 -                /bin/sh /usr/lib/ocf/resource.d/heartbeat/IPaddr2 monitor
     0     0  5726  5725  5710 69509 S ?         59  0.0      20 -                  ip -o -f inet route list match 192.168.1.5/24 scope link
     0     0  5727  5725  5710 69509 S ?         61  0.0      20 -                  awk BEGIN{best=0} /\// { mask=$1; sub(".*/", "", mask); if( int(mask)>=best ) { best=int(mask); best_ln=$0; } } END{print best_ln}
     0     0  5887 69509 69509 69509 S ?         15  0.0       0 -            /bin/sh /root/debugRA.sh 5710
     0     0  5890  5887 69509 69509 R ?         60  300       0 -              ps -o uid,ruid,pid,ppid,pgid,sess,state,tty,psr,pcpu,etimes,wchan,args -He
    90    90 69510 69506 69510 69510 S ?         44  0.0   10471 SyS_po     /usr/lib/pacemaker/pacemaker-attrd
    90    90 69511 69506 69511 69511 S ?         74  0.0   10471 SyS_po     /usr/lib/pacemaker/pacemaker-schedulerd
    90    90 69512 69506 69512 69512 S ?         48  0.0   10471 SyS_po     /usr/lib/pacemaker/pacemaker-controld
======================
Getting user stack for pid 5710 IPaddr2
#0  0x00007f5d935e6c61 in read () from /lib64/libc.so.6
#1  0x000055f7d7c599d5 in zread ()
#2  0x000055f7d7c472e1 in command_substitute ()
#3  0x000055f7d7c3f1c8 in ?? ()
#4  0x000055f7d7c41867 in ?? ()
#5  0x000055f7d7c409b4 in ?? ()
#6  0x000055f7d7c2c7a5 in ?? ()
#7  0x000055f7d7c2b3bd in execute_command_internal ()

.........

======================
Getting user stack for pid 5726 ip
#0  0x00007fbda12710c1 in nanosleep () from /lib64/libc.so.6
#1  0x00007fbda1270ffa in sleep () from /lib64/libc.so.6
#2  0x0000000000415085 in iproute_list_flush_or_save (argc=<optimized out>, argc@entry=4, argv=<optimized out>, argv@entry=0x7ffe90b38d68, action=action@entry=0) at iproute.c:1481
#3  0x000000000041590b in do_iproute (argc=5, argv=0x7ffe90b38d60) at iproute.c:2008
#4  0x0000000000409f4c in do_cmd (argv0=0x7ffe90b399e3 "route", argc=6, argv=0x7ffe90b38d58) at ip.c:115
#5  0x0000000000409a15 in main (argc=7, argv=0x7ffe90b38d50) at ip.c:305
Getting kernel stack for pid 5726
[<ffffffff86003ae4>] do_syscall_64+0x74/0x150
[<ffffffff8680009a>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<ffffffffffffffff>] 0xffffffffffffffff
======================
Getting user stack for pid 5727 awk
#0  0x00007feaca6fdc61 in read () from /lib64/libc.so.6
#1  0x000055a9cf35506e in ?? ()
#2  0x000055a9cf355968 in ?? ()
#3  0x000055a9cf342e6e in ?? ()
#4  0x000055a9cf3108a0 in ?? ()
#5  0x00007feaca634f8a in __libc_start_main () from /lib64/libc.so.6
#6  0x000055a9cf310bfa in ?? ()
Getting kernel stack for pid 5727
[<ffffffff8626241c>] pipe_wait+0x5c/0x90
[<ffffffff86262a9c>] pipe_read+0x20c/0x2e0
[<ffffffff86257cdb>] __vfs_read+0xdb/0x140
[<ffffffff86259069>] vfs_read+0x89/0x130
[<ffffffff8625a612>] SyS_read+0x42/0x90
[<ffffffff86003ae4>] do_syscall_64+0x74/0x150
[<ffffffff8680009a>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<ffffffffffffffff>] 0xffffffffffffffff
Finished: 1619451374.994192164
B4609009D:~ #

Keep the run time short, less than a second here:

B4609009D:~ # grep -e Star -e Fini ra-debug_20210426_113614.log
Started: 1619451374.146883177
Finished: 1619451374.994192164
B4609009D:~ #

Jun 16 '21 10:06 dclavender

Can one of the admins verify this patch?

Jun 16 '21 10:06 knet-ci-bot

Hi,

Thanks for submitting this. I can see how it would be useful.

I'm conflicted about it, though. Calling (and blocking on) a script for every child process that times out (which will be more than just resource agents) has the potential to cause big problems, especially if the reason the child process is timing out is due to strain on system resources that executing a script will make worse.

I'd feel better if the option were a resource meta-attribute instead of a global environment variable, so that it would only apply to invocations of that resource agent (sort of like the trace_ra attribute). Someone could set it as a resource default if they really wanted it to apply to every agent, but at least it would still be limited to only resource agents.

I'm also a little concerned that adding a hook like this into one spot will open the door to adding hooks anywhere, and without a plan for a consistent interface that can handle such growth, it could become unwieldy (and impossible to change afterward due to the need to keep backward compatibility). Documenting the feature as experimental and subject to change would give us some leeway to get experience with it before making it official.

@wenningerk @clumens @yan-gao @HideoYamauchi @oalbrigt -- any thoughts on the general idea here?

Jun 17 '21 22:06 kgaillot

Yep generally a good idea I guess. But I'm sharing the concern that adding multiple similar interfaces here and there would be suboptimal both regarding architecture and usability. Wasn't there a similar request quite recently? Some dejavu memory that I had put a statement mentioning alerts there - maybe thus alerts mentioned above. Anyway I still think that the alerts interface might be something to consider to prevent the bloat of interfaces offered to the user. That doesn't necessarily mean the async-calling - for good reason as already stated by Ken - has to be done here as well. Could think of some kind of sync-flag or so. And it is probably key to have a possibility for filtering prior to forking a child.

Jun 18 '21 07:06 wenningerk

Hi

I'm conflicted about it, though. Calling (and blocking on) a script for every child process that times out (which will be more than just resource agents) has the potential to cause big problems, especially if the reason the child process is timing out is due to strain on system resources that executing a script will make worse.

Indeed and this was my first thought when implementing the patch, however I weighed that against my most recent real scenario encountered, i.e.: an enterprise customer that suffered an outage/failover due to an IPaddr2 RA timing out and us not being able to give them a root cause or indeed any indication as to what/why... Telling them we couldn't do more until they saw the problem again was akin to telling Boeing we need a few more aircraft crashes to figure out what's going on... Seriously though, RA timeout cases are probably the most common pacemaker related calls we see and not being able to say anything about them, especially those one-offs is a major headache for end users and support.

I like the granularity idea of the resource meta-attribute, although in practical terms I know what would happen, end users would set this on every resource and we would practically be back to the global variable case.

Getting back to the strain on the system comment, I did add some timings in to the script to see what impact it might have, the execution time was sub-second in my test environment. I also made a point in the sysconfig/pacemaker file to say that you want to keep any such script lean and fast because of the serialisation impact it has. Having said that I think further redesign could eradicate the serialisation effects. This really is a PoC (Proof-of-Concept) at this time.

Wasn't there a similar request quite recently? Some dejavu memory that I had put a statement mentioning alerts there - maybe thus alerts mentioned above. Anyway I still think that the alerts interface might be something to consider to prevent the bloat of interfaces offered to the user. That doesn't necessarily mean the async-calling - for good reason as already stated by Ken - has to be done here as well. Could think of some kind of sync-flag or so. And it is probably key to have a possibility for filtering prior to forking a child.

Yes, Emil originally posted info about my patch with some comments to the mailing list:

https://lists.clusterlabs.org/pipermail/developers/2021-May/002368.html

You suggested doing the pull-request so that is what I did.

We tried the alerts method that you also suggested and I managed to get a slightly modified version of my debugRA.sh script to run, however as mentioned by the time it runs pacemaker-execd has already killed the RA process hierarchy. So whatever approach we use it must be deterministic and synchronised to pacemaker-execd's termination of tardy children in order to be useful.

I'm also a little concerned that adding a hook like this into one spot will open the door to adding hooks anywhere, and without a plan for a consistent interface that can handle such growth, it could become unwieldy (and impossible to change afterward due to the need to keep backward compatibility).

I totally get that. All I would ask that you consider, even if this current PoC is rejected is that some serious consideration be given to product supportability. As I keep labouring, RA timeouts happen a lot and end users don't like them being unexplainable and Support don't like having the processes killed before they can do anything to investigate them.

FYI, when troubleshooting this I considered several approaches, this was the only approach involving modifications directly to pacemaker, the other two approaches were:

a. using a shell wrapper around pacemaker and intercepting appropriate functions

b. using kernel systemtap to intercept kill syscalls from pacemaker-execd

I discounted (a), it's so risky and invasive you might as well try to get the feature in pacemaker. The (b) method is one I employ frequently for all kinds of different product problems and whilst this could achieve most of my goals it:

requires problem repeatability, i.e.: doesn't help with those one-off cases
needs to be re-worked for every OS distro/version
potentially needs to be left in place for long periods
can make end users nervous

As a consequence I decided to see if I could get something in to pacemaker directly that could be activated easily.

Anyway, I do appreciate your thoughts and ideas, not just on my patch proposal but any way we can improve supportability in this area if it takes a different direction.

-- Regards

Darren

Jun 18 '21 10:06 dclavender

https://lists.clusterlabs.org/pipermail/developers/2021-May/002368.html

You suggested doing the pull-request so that is what I did.

Thanks for helping my old brain here ;-) Thinking over it I guess addition of a sync-flag might not be that intrusive. Alerts are already executed in the execd-context. Execd is capable of executing multiple scripts simultanously. So we could queue in the alert, mark in the alert that we need to get triggered when the script-process exits (usually they are fire & forget for alerts), stall further exection of the timeout handling and just have it resumed when the trigger comes.

Jun 18 '21 10:06 wenningerk

Execd is capable of executing multiple scripts simultanously. So we could queue in the alert, mark in the alert that we need to get triggered when the script-process exits (usually they are fire & forget for alerts), stall further exection of the timeout handling and just have it resumed when the trigger comes.

Just for my benefit, could you elaborate on the timings and relationship between the alerts (plus executed scripts) and the RAs a bit more? This is specifically within the context of your sync-flag proposal and how we ensure that the alert script runs in a manner that will always ensure it can capture the data of interest, i.e.: information pertaining to the problem RA hierarchy.

Also, is there a method of communicating the RA pgid to the alert when the alert script triggers? In my current patch the pgid of the RA of interest gets passed as an argument to the debugging script, so we know exactly what process hierarchy we need to investigate. This is harder with the alerts if we're not given that information because we somehow have to figure it out. That or we have to just go look at all children of pacemaker-execd which seems wasteful of time and resources.

Could it be passed as an env-var (I don't see any kind of pid or pgid mentioned here):

https://clusterlabs.org/pacemaker/doc/deprecated/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_writing_an_alert_agent.html

Jun 18 '21 11:06 dclavender

I like the granularity idea of the resource meta-attribute, although in practical terms I know what would happen, end users would set this on every resource and we would practically be back to the global variable case.

Meta-attributes can be specified globally in the rsc_defaults section :-)

We tried the alerts method that you also suggested and I managed to get a slightly modified version of my debugRA.sh script to run, however as mentioned by the time it runs pacemaker-execd has already killed the RA process hierarchy. So whatever approach we use it must be deterministic and synchronised to pacemaker-execd's termination of tardy children in order to be useful.

I assume it could be either achieved with a registered callback upon timeout before killing of the process group, but that would likely need to extend the relevant APIs ...

Jun 18 '21 11:06 gao-yan

I'd feel better if the option were a resource meta-attribute instead of a global environment variable, so that it would only apply to invocations of that resource agent (sort of like the trace_ra attribute). Someone could set it as a resource default if they really wanted it to apply to every agent, but at least it would still be limited to only resource agents.

Or it could be an op parameter, which could be globally specified in op_defaults. Disadvantage is it could be kind of bothering to specify for all the ops of a single resource though.

Jun 18 '21 12:06 gao-yan

Just for my benefit, could you elaborate on the timings and relationship between the alerts (plus executed scripts) and the RAs a bit more? This is specifically within the context of your sync-flag proposal and how we ensure that the alert script runs in a manner that will always ensure it can capture the data of interest, i.e.: information pertaining to the problem RA hierarchy.

All data passed to alerts is recorded when and in the context the event that an alert is supposed to inform about is happening. Then this data is passed to execd to be executed in that context. For alerts to be executed as timely as possible and not to throttle execd operation they are fired off one after the other without waiting for the previous one to finish. This execution-pattern doesn't even grant order of execution to reflect occurrence which is why we record high-res timestamps upon putting together the data for an alert-agent. My crude idea behind sync-flag - didn't look into details - was now that a timer that would up to now trigger killing the RA directly now would queue execution of an alert-agent instead (if registered) and if it has the sync-flag set it would then instead of doing the termination directly setup a trigger for the final termination (set by the alert-agent returning execution). As RA-actions are executed within execd context a simple trigger should be sufficient. If we want to make the sync-feature available for other alerts as well we'd have to use some kind of IPC.

Hope that clarifies my idea a bit. As said I haven't checked for implementation details that might make things complicated or even not worth the hassle.

Could it be passed as an env-var (I don't see any kind of pid or pgid mentioned here):

Yep, that would have been my idea. Of course not there yet but I guess of generic interest for alert-agents.

Jun 18 '21 12:06 wenningerk

I actually don't think we should use alerts for this. It's not a notification that something happened, it's a pre-event hook to collect statistics, which I think is logically separate. The sync vs async question is a result of the logical difference.

I agree that we currently lack visibility into what happens just before a problem. To some extent, external tools like sar or collectd are the right solution (you just have to match up timestamps). However for agent timeouts, it can still be difficult to figure out what's related.

Perhaps something more targeted might be appropriate. We could have an option specifically for collecting system statistics at the point of a timeout, and build it in rather than call out to a script. That would take away a lot of flexibility, but it would keep the impact to a known level. We'd just need to figure out what would be appropriate to grab at that moment.

As an interface I'm thinking something that says "collect these statistics at these points", with RA timeout being the only supported point initially. I'm thinking we should append the results to a binary file, so in the future we can add some tools for numeric analysis (or third-party tools could do it). We'd probably want a command-line tool to dump the contents in a readable way. We'd also need some sort of log rotation for it.

That's a lot bigger project than what's here, but we could consider it experimental and build it out in steps.

Alternatively, if there's some third-party system statistics collection tool that has a well-defined interface, we could build support for that, and trigger it to collect when we need it. The sync issue might be the only question there.

Jun 18 '21 15:06 kgaillot

Hi Ken, Hi All,

For now, I have no comments to add. We need to discuss a little more.

Our Japanese community members have previously tackled a similar timeout issue on corosync. I think there are comments and requests from the members of the Japanese community.

Best Regards, Hideo Yamauchi.

Jun 18 '21 16:06 HideoYamauchi

Ken et al,

I agree that we currently lack visibility into what happens just before a problem. To some extent, external tools like sar or collectd are the right solution (you just have to match up timestamps). However for agent timeouts, it can still be difficult to figure out what's related.

Unfortunately performance tools like sar/collectl will only tell you whether some process in the RA-script is busy, that may be useful but it doesn't then provide any indication of what it is busy doing. For that you you want to start with those stack traces. If it's blocked a single stack trace will do, if it is busy then you want samples or better yet trace data. Quite often just getting the process stack traces will provide sufficient clues in the first iteration to ascertain what the problem is, hence the design of the debugRA.sh script.

Perhaps something more targeted might be appropriate. We could have an option specifically for collecting system statistics at the point of a timeout, and build it in rather than call out to a script. That would take away a lot of flexibility, but it would keep the impact to a known level. We'd just need to figure out what would be appropriate to grab at that moment.

I actually had this exact discussion with somebody and I strongly rejected the notion in favour of the call out, my reasons being:

with regards to doing this in pacemaker, the biggest problem with that is flexibility. With an external call-out you can easily adjust and adapt the information you gather between troubleshooting iterations and that could be done in a script, an executable, an stap etc. If this is built in to pacemaker it loses that flexibility and importantly any rapid method of adjustment. Further troubleshooting would entail further changes of the pacemaker code which means more coding, rebuilding, pacemaker testing and potentially for some customers further validation. This in turn entails more risk, the risk of introducing bugs leading to application failure or other erroneous behaviour.
it makes pacemaker less implementation agnostic as capturing stacks and doing that kind of work is very implementation dependent. Not to mention pacemaker's role is cluster management, it should provide us with the framework to provide supportability but not be responsible for performing the support function itself.
finally there is the question of security. Although pacemaker is the parent of the RAs from a compartmentalisation perspective I don't like giving it the kind of access we might have to in more secure environments to do things that might raise red flags to auditors. A small standalone script or executable can be more easily secured/controlled (e.g.: app-armour/selinux) and being a script its nature can be made more transparent with respect to its function.

Concerning the impact of an external callout, I think we show due diligence in commenting the config file and/or the example scripts with appropriate warnings of what you should/should-not do and how long you spend collecting data. The biggest problem that I see with my current PoC is the serialisation. If that impediment is removed then it looks much lower risk to pacemaker-execd's functionality. Perhaps we could also make such functionality only apply to RAs and not anything else that pacemaker-execd might fire-off...??

-- Regards

Darren

Jun 18 '21 16:06 dclavender

I have seen many timeout reports on systems where antivirus software is installed. The culprit for timeout is antivirus software's on-access scanning kernel module which hooks e.g. open()/execve() operations using closed source kernel module.

The only hint that suggests possibility of being unexpectedly disturbed by on-access scanning kernel module are messages like https://access.redhat.com/solutions/3227401 and https://success.trendmicro.com/solution/000262458 .

What is important for me to investigate this problem is to periodically collect debug information (such as /proc/$pid/stack of RA processes) before 60 seconds timeout expires. That is, run collector process asynchronously rather than run synchronously upon timeout.

To minimize the impact of CPU/memory usage for collecting debug information, and to minimize the possibility of being unexpectedly disturbed by on-access scanning kernel module, the collector (which sends stack trace of processes to syslog when 'descendant processes of the pacemaker-execd process invoked by the pacemakerd process' are alive for more than threshold duration) is written in C language.

What I wish Pacemaker for help minimizing the impact of CPU/memory usage for collecting debug information is to run all Pacemaker related processes in some cgroup so that the collector needs to scan only pids listed in e.g. /sys/fs/cgroup/pids/pacemaker/tasks instead of scanning entire /proc/$pid/ directories.

Can we support running only Pacemaker related processes in some cgroup, for processes started/stopped by Pacemaker (e.g. PostgreSQL) must run in their intended cgroups?

Jun 22 '21 03:06 KumanekoSakura

What is important for me to investigate this problem is to periodically collect debug information (such as /proc/$pid/stack of RA processes) before 60 seconds timeout expires. That is, run collector process asynchronously rather than run synchronously upon timeout.

That does sound more useful to me.

To minimize the impact of CPU/memory usage for collecting debug information, and to minimize the possibility of being unexpectedly disturbed by on-access scanning kernel module, the collector (which sends stack trace of processes to syslog when 'descendant processes of the pacemaker-execd process invoked by the pacemakerd process' are alive for more than threshold duration) is written in C language.

That also makes sense.

It sounds to me like the collection interval should be configurable as well, and potentially limited to particular resources and/or actions. That points toward a CIB interface instead of environment variable, something along the lines of:

<primitive id="rsc1" class="ocf" provider="pacemaker" type="Dummy">
    <operations>
        <op id="rsc1-monitor-60s" name="monitor" interval="60s" timeout="31s"
               stats_agent="/path/to/binary" stats_interval="10s"/>
    </operations>
</primitive>

The values could be set in op_defaults to apply to all operations (or using rules, to specific resources and/or operations).

We'd also need a timeout for the stats collection, after which we kill it. We could reuse the stats interval for this.

We should consider what information might be useful inside a stats agent and what the interface should be (environment variables or command-line arguments, and potentially some meaning to exit codes).

A variation on this approach might be to define a new OCF resource agent action to do the collection (instead of defining the binary, the user would configure only the interval, and pacemaker would call the normal resource agent with the new action name). The downsides are (1) it's less easy to customize locally, (2) it could only be used with OCF agents, and (3) OCF agents are typically interpreted scripts. The upsides are (1) the interesting information to collect is likely agent-specific (e.g. IPaddr2 vs. httpd might want very different and specific information), and (2) it's easier for the end user. Appealing, but maybe not practical.

Any CIB-based approach would be significantly more complicated than this PR, though.

What I wish Pacemaker for help minimizing the impact of CPU/memory usage for collecting debug information is to run all Pacemaker related processes in some cgroup so that the collector needs to scan only pids listed in e.g. /sys/fs/cgroup/pids/pacemaker/tasks instead of scanning entire /proc/$pid/ directories.

Can we support running only Pacemaker related processes in some cgroup, for processes started/stopped by Pacemaker (e.g. PostgreSQL) must run in their intended cgroups?

That's an interesting idea that will require some investigation.

Jun 22 '21 18:06 kgaillot

pacemaker pacemaker copied to clipboard

Feature: support debugging of timed out resource agents

pacemaker
pacemaker copied to clipboard