opennms icon indicating copy to clipboard operation
opennms copied to clipboard

Add startup/shutdown progress bar and improve communication of startup errors to users

Open deejgregor opened this issue 2 years ago • 5 comments

What it looks like on success

$ bin/opennms restart
Stopping OpenNMS: 100% [=============================] 26/26 (0:00:07 / 0:00:00) All Passes Complete
Starting OpenNMS: 100% [=============================] 52/52 (0:00:32 / 0:00:00) All Passes Complete
$

output

What it looks like on failure

(I added extra bogus text to the class name for eventd in etc/service-configuration.xml)

$ bin/opennms restart
Stopping OpenNMS: Trying to stop OpenNMS but it's already stopped.
Starting OpenNMS:   3% [=>                                 ]  2/52 (0:00:01 / 0:00:29) Pass 0: TestLoadLibraries
An error occurred while attempting to start the "OpenNMS:Name=Eventd" service (class org.opennms.netmgt.eventd.jmx.Eventdbogus).  Shutting down and exiting.
java.lang.ClassNotFoundException: org.opennms.netmgt.eventd.jmx.Eventdbogus
	at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
	at java.base/java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:904)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:315)
	at org.opennms.netmgt.vmmgr.Invoker.instantiateClasses(Invoker.java:159)
	at org.opennms.netmgt.vmmgr.Starter.start(Starter.java:156)
	at org.opennms.netmgt.vmmgr.Starter.startDaemon(Starter.java:96)
	at org.opennms.netmgt.vmmgr.Controller.start(Controller.java:173)
	at org.opennms.netmgt.vmmgr.Controller.main(Controller.java:150)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.opennms.bootstrap.Bootstrap$4.run(Bootstrap.java:531)
	at java.base/java.lang.Thread.run(Thread.java:829)

Stopping OpenNMS: 100% [=========================================] 26/26 (0:00:00 / 0:00:07) All Passes Complete
Shutdown completed after receiving on error on startup. See error details above.
Started OpenNMS, but it stopped running: see details above or /Users/dgregor/git/opennms/target/opennms-31.0.0-SNAPSHOT/logs/output.log
$

systemd

Unfortunately, we won't see the updates directly out of sudo systemctl restart opennms. To see status during OpenNMS restarts use journalctl -f -u opennms to monitor separately (this is what is used for the excerpts below) and/or something like watch -n 1 systemctl status opennms (I personally like seeing the systemd view with the latter).

Oct 23 00:42:20 dgregor-vm-perf systemd[1]: Starting OpenNMS server...
Oct 23 00:42:22 dgregor-vm-perf opennms[33047]: Starting OpenNMS: running ulimit -a
Oct 23 00:42:22 dgregor-vm-perf opennms[33047]: core file size          (blocks, -c) 0
...
Oct 23 00:42:22 dgregor-vm-perf opennms[33047]: file locks                      (-x) unlimited
Oct 23 00:42:22 dgregor-vm-perf systemd[1]: opennms.service: Can't open PID file /run/opennms/opennms.pid (yet?) after start: Operation not permitted
Oct 23 00:42:22 dgregor-vm-perf systemd[1]: opennms.service: Supervising process 33601 which is not our child. We'll most likely not notice when it exits.
Oct 23 00:42:22 dgregor-vm-perf systemd[1]: Started OpenNMS server.
Oct 23 00:42:23 dgregor-vm-perf opennms[33047]: (not waiting for startup)
Oct 23 00:42:24 dgregor-vm-perf opennms[33600]: Starting OpenNMS:   0% [                                                ]  0/52 (0:00:42 / 0:00:42)
Oct 23 00:42:24 dgregor-vm-perf opennms[33600]: Starting OpenNMS:   1% [                       ]  1/52 (0:00:42 / 0:00:42) Pass 0: TestLoadLibraries
...
Oct 23 00:43:02 dgregor-vm-perf opennms[33600]: Starting OpenNMS: 100% [=============================] 52/52 (0:01:21 / 0:00:03) All Passes Complete

On failure:

...
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]: Starting OpenNMS:   3% [                       ]  2/52 (0:00:42 / 0:00:42) Pass 0: TestLoadLibraries
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]: An error occurred while attempting to start the "OpenNMS:Name=Eventd" service (class org.opennms.netmgt.eventd.jmx
.Eventdblarg).  Shutting down and exiting.
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]: java.lang.ClassNotFoundException: org.opennms.netmgt.eventd.jmx.Eventdblarg
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]:         at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]:         at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]:         at java.base/java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:904)
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]:         at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]:         at java.base/java.lang.Class.forName0(Native Method)
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]:         at java.base/java.lang.Class.forName(Class.java:315)
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]:         at org.opennms.netmgt.vmmgr.Invoker.instantiateClasses(Invoker.java:160)
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]:         at org.opennms.netmgt.vmmgr.Starter.start(Starter.java:156)
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]:         at org.opennms.netmgt.vmmgr.Starter.startDaemon(Starter.java:96)
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]:         at org.opennms.netmgt.vmmgr.Controller.start(Controller.java:173)
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]:         at org.opennms.netmgt.vmmgr.Controller.main(Controller.java:150)
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]:         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]:         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]:         at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43
)
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]:         at java.base/java.lang.reflect.Method.invoke(Method.java:566)
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]:         at org.opennms.bootstrap.Bootstrap$4.run(Bootstrap.java:531)
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]:         at java.base/java.lang.Thread.run(Thread.java:829)
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]: Stopping OpenNMS:   0% [                                                ]  0/26 (0:01:09 / 0:01:09)
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]: Stopping OpenNMS:   3% [=                                  ]  1/26 (0:01:09 / 0:01:09) Pass 0: Trapd
...
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]: Stopping OpenNMS: 100% [=============================] 26/26 (0:01:09 / 0:01:09) All Passes Complete
Oct 22 02:05:26 dgregor-vm-perf opennms[127533]: Shutdown completed after receiving on error on startup. See error details above.
Oct 22 02:05:28 dgregor-vm-perf opennms[127711]: Stopping OpenNMS:
Oct 22 02:05:28 dgregor-vm-perf opennms[127967]: Trying to stop OpenNMS but it's already stopped.
Oct 22 02:05:28 dgregor-vm-perf opennms[127711]: stopped
Oct 22 02:05:28 dgregor-vm-perf systemd[1]: opennms.service: Killing process 127534 (n/a) with signal SIGKILL.
Oct 22 02:05:28 dgregor-vm-perf systemd[1]: opennms.service: Killing process 127534 (n/a) with signal SIGKILL.
Oct 22 02:05:28 dgregor-vm-perf systemd[1]: opennms.service: Succeeded.

New log files

  • manager.json contains elapsed times from previous stop/start runs; used for the ETA on the progress bar. Blame @mhuot for suggesting adding this (it was actually quite easy).
  • progressbar.log contains the progress bar output. This is just tailed in the background in the opennms script.
$ tail -4 logs/manager.json
{"time":"2022-09-24T18:48:21.181Z","elapsed":"PT7.502S","type":"STOP"}
{"time":"2022-09-24T18:48:58.978Z","elapsed":"PT32.134S","type":"START"}
{"time":"2022-09-24T18:49:26.387Z","elapsed":"PT7.53S","type":"STOP"}
{"time":"2022-09-24T18:50:03.164Z","elapsed":"PT31.122S","type":"START"}
$ tail logs/progressbar.log
Start 100% [=======================] 52/52 (0:00:31 / 0:00:00) PerspectivePoller
Stop 100% [==================================] 26/26 (0:00:07 / 0:00:00) Manager
Start 100% [=======================] 52/52 (0:00:32 / 0:00:00) PerspectivePoller
Stop 100% [==================================] 26/26 (0:00:07 / 0:00:00) Manager
Start 100% [=======================] 52/52 (0:00:30 / 0:00:01) PerspectivePoller
$

Files changed in etc

  • log4j2.xml -- two things:
    • Change Invoker and Manager log level to DEBUG but set the log level for the console appender to INFO. Functionally, this keeps the same level of information on the console as before, but logs these classes at DEBUG level to the log files. This way, messages for the Invoker and Manager get logged to the log file at the same level as all of our other logs (there are occasionally some useful bits in there).
    • Add new logging routes for the manager JSON file (which records startup/shutdown times) and the progress bar log file. The progress bar support will not work without these changes.
  • opennms.service -- change to ExecStop configuration option to add -s option when calling opennms stop. This is a nice-to-have to not have duplicate logs in systemd on shutdown, but is not needed, and it is backward-compatible.

Because of this, it would be preferable (but I don't know how preferable) to get this change in at the initial release on a foundation branch.

New opennms.conf option

PROGRESS_BAR defaults to 1 to enable this feature. Add PROGRESS_BAR=0 to opennms.conf to disable and get the old functionality.

New libraries

  • me.ctongfei:progressbar -- https://github.com/ctongfei/progressbar/
  • org.jline:jline -- https://github.com/jline/jline3 (transitive dependency of progressbar)

New system properties

  • opennms.name -- the value that gets substituted in at build time from ${install.package.description}. Defaults to OpenNMS.
  • terminal.columns -- the terminal width at start time (if it could be determined). Defaults to 100. Only set if tput determines that there is a tty with a TERM set. The existence of this property is also used to decide if the progress bar will be in "interactive" mode, which case the output is tweaked a bit (mostly to support systemd, which will be non-interactive).

Hack (no longer needed; removed)

progressbar 0.9.5 has been released the our code updated, so this hack is gone.

~~The existing ETA calculation in the progress bar library doesn't work well for our non-linear daemon startup times. The library very recently added a way to for callers to provide a function to calculate ETA through ProgressBarBuilder.setETAFunction, however the method signature references a package private class, so it can't be externally called. I hacked around it by adding a small helper class in the me.tongfei.progressbar package. Once https://github.com/ctongfei/progressbar/tree/0.9.5 is merged (https://github.com/ctongfei/progressbar/pull/146 has already been merged to that branch, but the approach might be tweaked a bit, which is fine), this hack can go away.~~

To-Do

  • [x] Is this a good way to do this? Any suggestions on approach?
  • [x] Probably some code cleanup; I kinda just go this working and got the high-level structure decent-ish.
  • [x] Cleanup commit messages, maybe
  • [x] There is one open XXX item in the opennms startup script around systemd that I could use someone more familiar with systemd to take a look at.

All Contributors

Contribution Checklist

  • Please make an issue in the OpenNMS issue tracker if there isn't one already.
    Once there is an issue, please:
    1. update the title of this PR to be in the format: ${JIRA-ISSUE-NUMBER}: subject of pull request
    2. update the JIRA link at the bottom of this comment to refer to the real issue number
    3. prefix your commit messages with the issue number, if possible
    4. once you've created this PR, please link to it in a comment in the JIRA issue Don't worry if this sounds like a lot, we can help you get things set up properly.
  • If this code is likely to affect the UI, did you name your branch with -smoke in it to trigger smoke tests?
  • If this is a new or updated feature, is there documentation for the new behavior?
  • If this is new code, are there unit and/or integration tests?
  • If this PR targets a foundation-* branch, does it try to avoid changing files in $OPENNMS_HOME/etc/?

What's Next?

A PR should be assigned at least 2 reviewers. If you know that someone would be a good person to review your code, feel free to add them.

If you need help making additions or changes to the documentation related to your changes, please let us know.

In any case, if anything is unclear or you want help getting your PR ready for merge, please don't hesitate to say something in the comments here, or in the #opennms-development chat channel.

Once reviewer(s) accept the PR and the branch passes continuous integration, the PR is eligible for merge.

At that time, if you have commit access (are an OpenNMS Group employee or a member of the OGP) you are welcome to merge the PR when you're ready. Otherwise, a reviewer can merge it for you.

Thanks for taking time to contribute!

External References

  • JIRA (Issue Tracker): https://opennms.atlassian.net/browse/NMS-15116

deejgregor avatar Sep 24 '22 18:09 deejgregor

This generally looks OK to me; what does it look like in systemctl? Do we need to make init calls to startup disable the progress bar?

RangerRick avatar Oct 06 '22 14:10 RangerRick

This generally looks OK to me; what does it look like in systemctl? Do we need to make init calls to startup disable the progress bar?

This. Generally, users will never run bin/opennms start by hand, they systemctl start opennms.

dino2gnt avatar Oct 06 '22 15:10 dino2gnt

This generally looks OK to me; what does it look like in systemctl? Do we need to make init calls to startup disable the progress bar?

That was my last question. :) I don't have any Linux systems, but I will spinup a test on an ubuntu container when I have enough spare cycles and see how things look and tweak as needed. I was almost there last week but ran out of time while chasing down dependencies for download-artifacts.pl.

deejgregor avatar Oct 11 '22 23:10 deejgregor

This. Generally, users will never run bin/opennms start by hand, they systemctl start opennms.

@dino2gnt I did some testing on Ubuntu with systemd and made some improvements. I added a systemd section to the description of this PR with some samples. It doesn't make any difference if the user just runs sudo systemctl restart opennms, but it works out well if they run journalctl -f -u opennms to tail the logs or something like watch -n 1 systemctl status opennms.

And I learned about systemd in the process. :)

deejgregor avatar Oct 26 '22 05:10 deejgregor

@RangerRick I have pointed this progress bar PR back at develop, all checks passed, and it's ready to review. Do you want to add anyone else that should take a look? Also, since this is pointing at develop, there is no rush--it can wait a bit while release-31.x work is happening.

@dino2gnt If you want to review how this shows up in systemd, there is a systemd section in the description of the PR.

deejgregor avatar Dec 07 '22 19:12 deejgregor