pacemaker icon indicating copy to clipboard operation
pacemaker copied to clipboard

Fix: tools: crm_mon --daemonize should update when disconnected

Open nrwahl2 opened this issue 2 years ago • 6 comments

With crm_mon --daemonize, the output will continue to show the last known status after the cluster is stopped on the local node. It should go back to something like "Cluster is not available".

Likewise, messages like "Reconnecting..." should not go to the daemonized output. The output file (or external handler) should receive only the cluster status. So we print those messages only if output_format is mon_output_console (or if we're in one-shot mode, where applicable). Normally those messages don't get printed to the daemonized output anyway because we don't flush the buffer, but theoretically they could get printed if the buffer fills up.

It's questionable whether we want to print "Cluster is not available" when we're in interactive mode or only in daemonized mode. For now, we'll do daemonized-only. Interactive (console) mode already gets the "Connection to cluster daemons terminated" message, which is quickly replaced by "Reconnecting...".

External agents are notified via traps rather than via output, so we can ignore them.

Closes T15

nrwahl2 avatar Sep 15 '22 04:09 nrwahl2

The first commit may not be comprehensive -- for example, "waiting for start" type messages may still appear in daemonized text output during startup. It's an improvement and should take care of most situations after we're already up and running.

nrwahl2 avatar Sep 15 '22 04:09 nrwahl2

pushed from the wrong local branch...

nrwahl2 avatar Oct 06 '22 08:10 nrwahl2

Updated. This gets the job done for daemonized mode without switching on output formats and such, and without negatively impacting console or one-shot mode.

It does not print a string representation of the pacemakerd status (e.g., "Pacemaker daemons shutting down") if the pacemakerd state is noteworthy when run in one-shot mode. That would require either merging pcmk_status.c:pacemakerd_status() and crm_mon.c:pacemakerd_status() into one function that can fulfill both use cases' requirements, or adding a fair amount of code to pcmk_status.c:pacemakerd_status() and its callers.

I think it would be a good idea to merge them into library functions at some point -- and likewise for other similar static functions. I started working on it and named it pcmk__pacemakerd_status()... and then it failed to compile because there's already a function with that name in pcmk_cluster_queries.c. That function and its event callback do some of the same things but were built with a different use case in mind. Getting everything merged cleanly shouldn't be a huge undertaking, but it's not trivial and so I don't want to let it hold up this PR.

I also put a couple of formatter functions for a new message in crm_mon.c and registered them directly. It would make intuitive sense to have crm_mon_register_messages() register them. That function currently is defined in crm_mon_curses.c and registers formatters specifically for curses. If we want, we can change that function name to "crm_mon_register_curses_messages" and then define a "crm_mon_register_messages" in crm_mon.c.

nrwahl2 avatar Oct 06 '22 09:10 nrwahl2

Depending on how much effort it is, before having this merged, I may try to make the console and daemonized mode share more code with one-shot mode since we're doing that work in #2902. That would entail, at minimum, using pcmk__pacemakerd_status(). It could also involve sharing a cib_connect() and fencing_connect() implementation and maybe incorporating pcmk__status()... I haven't given it any detailed thought since starting on #2902.

nrwahl2 avatar Oct 11 '22 06:10 nrwahl2

Updated minimally to address review. Most of the massive Compare output is from rebasing on current main and resolving conflicts. There's still room for improvement in both the output and the implementation, but this gets the job done.

I'll be filing another PR with some other refactors as a start

nrwahl2 avatar Oct 17 '22 06:10 nrwahl2

The CI failures were RPM issues on three hosts. Two of them were OpenSUSE hosts that timed out when trying to access the libqb100 RPM. The other was CentOS 9 s390x: " - nothing provides libqb(s390-64) = 2.0.6-1.el9.next needed by libqb-devel-2.0.6-1.el9.next.s390x"

nrwahl2 avatar Oct 17 '22 08:10 nrwahl2

Totally replaced commits.

Still working out some schema issues (edit: fixed in #2919)

nrwahl2 avatar Nov 03 '22 22:11 nrwahl2

Updated to address review, which was more involved than expected (see https://github.com/ClusterLabs/pacemaker/pull/2868#discussion_r1014536512). This is already way too big, so if it gets any bigger, it'll be split up.

  • Updated cib__signon_query() to return pcmk_rc_no_input if (rc == pcmk_rc_ok) && (*cib_object == NULL).
  • Added a commit at the beginning to fix the pcmkd state display for XML format (should use the short display instead of friendly).
  • Added two commits to include pacemakerd status in the cluster-summary message for native CIB connections. It's excluded by default for file and remote CIB connections (if you include it, it will always say "Unknown"). It's also specifically excluded when pcmk_simulate calls cluster-summary.

Not addressed yet:

  • Newline before "Connection to the cluster lost". Need to figure out whether to leave that as-it-was, add the newline, or modify the curses error/info functions to start at the beginning of the line.

nrwahl2 avatar Nov 05 '22 00:11 nrwahl2

Updated to:

  • remove the newline from the "Connection to cluster lost" transient message
  • move the pacemakerd state into the stack section:
    <stack type="corosync" pacemakerd-state="running"/>
    <stack type="corosync" pacemakerd-state="shutting_down"/>

  * Stack: corosync (Pacemaker is running)
  * Stack: corosync (Pacemaker daemons are shutting down)
  • Remove the pacemakerd-health messages from crm_mon.c (the crm-mon-disconnected message covers this already for the non-one-shot case)

nrwahl2 avatar Nov 07 '22 20:11 nrwahl2