storm
storm copied to clipboard
STORM-3173: flush metrics to ScheduledReporter on shutdown
https://issues.apache.org/jira/browse/STORM-3173
https://issues.apache.org/jira/browse/STORM-3128 This bug in test appeared again and broke the test in this PR but I don't actually know the cause of it.
Build logs report test crash [INFO] Running org.apache.storm.scheduler.resource.TestResourceAwareScheduler, but it's passing on my local machine. Don't know what's happening. It'll be great if you guys can offer some insight. @Ethanlm @srdo @HeartSaVioR
The exit code is 20. Try looking for uses of Utils.exitProcess
with 20
as the argument. If you change the values to be unique and rerun the tests, you should be able to nail down where the exit is happening.
I managed to fix the original timeout error by wrapping around a nimbus gauge with exception handling code. However the PR still failed the storm-server test due to error in starting fork. I don't really know what caused this. @srdo Have you seen anything like this before?
Additionally, I think our tests have some flaws in manipulating zookeeper as I constantly see a component tries to connect to Zookeeper that is not set up in the test, causing timeout warning in https://issues.apache.org/jira/browse/STORM-3128
No, but like I said, I suspect that the exit code is being set by a call to Utils.exitProcess
. If you try changing some of the exit codes to be unique, it's likely you'll be able to tell which call to Utils.exitProcess
is responsible for killing the JVM, which could help you tell what's happening.
The lastest failure didn't provide an error code though.
Right, sorry, didn't recheck the logs.
Unless it turns out to be easy to fix the test errors, I'd like to propose not merging this and fixing this issue as part of making the metrics registry non-static. I added reporter shutdowns to all the daemons on the non-static-metrics branch, and it works with no need for bandaid fixes https://travis-ci.org/srdo/storm/builds/413283316.
Interesting. Does your implementation also guarantee flushing all metrics upon exit?
It should. Take a look at https://github.com/srdo/storm/compare/892179527b5a...0916fae11cb2#diff-9c4945e8e924565ca1f071435f4711e9. If you compare the daemon classes (Nimbus, Supervisor etc.) there should be a call to stopMetricsReporters in all of them, that should stop the reporters on shutdown.
I can't see changes in your link, so I looked into your latest change in branch non-static-metrics
. Please correct me if I interpret your code wrong.
Your implementation didn't actually flush the metrics for ScheduledReporter on exit. You want to add reporter.report()
in ConsolePreparableReporter or CsvPreparableReporter and test it out. See my commit STORM-3173: Addressing nits
Okay that alerted me one thing. What's our config file for the testing? If it doesn't use ScheduledReporter for testing, this PR shouldn't make a matter at all.
Also, I went back and dug into the error code you suggested. It turned out to be caused by heartbeatTimer (instance of StormTimer) in Supervisor in AsyncLocalizerTest.testKeyNotFoundException. But I don't see where Supervisor is created. Super weird.
2018-08-07 16:38:11.217 [heartbeatTimer] ERROR org.apache.storm.daemon.supervisor.DefaultUncaughtExceptionHandler - Error when processing event on thread Thread[heartbeatTimer ,10,main]
...
2018-08-07 16:38:11.218 [heartbeatTimer] ERROR org.apache.storm.utils.Utils - val 202 2018-08-07 16:38:11.218 [heartbeatTimer] ERROR org.apache.storm.utils.Utils - Halting process: Error when processing an event java.lang.RuntimeException: Halting process: Error when processing an event at org.apache.storm.utils.Utils.exitProcess(Utils.java:474) [storm-client-2.0.0-SNAPSHOT.jar:?] at org.apache.storm.daemon.supervisor.DefaultUncaughtExceptionHandler.uncaughtException(DefaultUncaughtExceptionHandler.java:25) [classes/:?] at org.apache.storm.StormTimer$StormTimerTask.run(StormTimer.java:253) [storm-client-2.0.0-SNAPSHOT.jar:?]
Thanks again for helping me out of this issue though. Hope this don't create too much trouble for you.
I've added calls to report to the branch, the tests are running at https://travis-ci.org/srdo/storm/builds/413332618.
Unless the test specifies another configuration file, I would expect us to be using https://github.com/apache/storm/blob/master/conf/defaults.yaml.
I also don't understand why heartbeatTimer would still be running in AsyncLocalizerTest. JUnit doesn't shut down the JVM between tests, so it would make sense that a thread could leak from another test into AsyncLocalizerTest, but it would require someone opening a Supervisor in a test somewhere and forgetting to close it.
Happy to help out.
I am also getting org.apache.maven.surefire.booter.SurefireBooterForkException: Error occurred in starting fork, check output in log
on my VM. And the tests passed without this patch. Still need to investigate the issue
Tested it manually on a single-node cluster. nimbus:num-shutdown-calls
was flushed now with this patch. Still need to figure out travis build failure before merging this in.
Before spending too long on it, consider if we should fix this issue and https://github.com/apache/storm/pull/2714 by making the metrics registry non-static instead (https://github.com/apache/storm/pull/2783). I think it will be much easier to do.
That's right. Will take a look at non-static registry after #2764 and #2764 get merged in
Once the two PRs with new metrics are merged, I'll try to update the non-static branch and get a PR opened. It might take me a little while.
sure. Take your time please.
what was the status of this? I think work @srdo was doing addressed STORM-3173 possibly? Was that change committed?
The change is here https://github.com/apache/storm/pull/2805 and hasn't been merged yet.
#2764 also depends on resolution of this issue. Please consider merging that as well as part of the patch. @srdo
@zd-project I'm happy to take a look at implementing https://github.com/apache/storm/pull/2764 once the non-static PR is merged. I don't think it makes sense to mix more stuff into the current PR, and if it doesn't get approved, it'll be wasted work.
I am going to close this PR as #2764 was also closed after a list discussion. In addition, we are currently cleaning up old issues with stale discussions. The last comment here was made 5 years ago and Storm as evolved. If this is still an issue or relevant for your work, feel free to re-open and to rebase / fix the related PR.