accumulo icon indicating copy to clipboard operation
accumulo copied to clipboard

Consider adding critical thread metrics for monitoring

Open EdColeman opened this issue 6 years ago • 5 comments

Exposing metrics for critical process threads could improve monitoring and provide additional insight for performance trending.

For example, certain threads in master and in the tserver processes need to run periodically, if they do not, this is an indication that the process is likely unhealthy / having issues. Exposing the fact that the threads are running, progressing or had a successful completion of a task would improve monitoring capabilities Additionally, if the "run-time" was provided, this could be used to gauge relative health by trending the performance over time / across upgrades,....

This issue is to capture possible candidate threads / processes that would be beneficial to incorporate into metrics reporting.

EdColeman avatar Feb 08 '19 23:02 EdColeman

The tablet server has a thread that continual checks if tablets need compaction or split.

https://github.com/apache/accumulo/blob/c7a54c80b937fd606ebdd6b3672f837f65b6258f/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L2175

The master has three really important threads that assign tablets for the root, metadata, and user tablets. If these threads are not running then tablets will not be assigned. These threads are all run by TabletGroupWatcher

https://github.com/apache/accumulo/blob/050dec2003a786ea014c994a38e180a82b997c0d/server/master/src/main/java/org/apache/accumulo/master/TabletGroupWatcher.java#L136

There is currently code that watches compactions and logs a warning if it has not read or written any data in a certain amount of time. It addition to logging a warning it might be nice to increment a stuck compaction counter. Could decrement when unstuck.

https://github.com/apache/accumulo/blob/2b9c9275ea5f992cfa2bd1a7e3f8994a41e69df3/server/tserver/src/main/java/org/apache/accumulo/tserver/tablet/CompactionWatcher.java#L31

There is also code that looks stuck tablet loads and logs a warning. Would also be nice to have a counter for stuck loads.

https://github.com/apache/accumulo/blob/b915947c0c22d9db717067b601c62829205d1505/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServerResourceManager.java#L437

keith-turner avatar Feb 11 '19 17:02 keith-turner

PR: https://github.com/apache/accumulo/pull/1379 - added improved metadata consistency checking - this may be another candidate for improved reporting. Initially considered adding it to the gc metrics improvements, but decided that a more comprehensive look and additional testing makes it better suited as a separate change.

EdColeman avatar Feb 03 '20 18:02 EdColeman

#2524 added the monitoring of critical background threads, throwing an Error in the event that a critical background thread terminated abnormally. @EdColeman - do you think #2524 and the other merged PRs linked here are sufficient to close this issue?

dlmarion avatar Mar 07 '22 18:03 dlmarion

This was originally proposed as metrics / monitoring at a level such that operator and app developers could gain insight into overall health and trends. Having the threads throw exceptions is great. But, this was more directed to allowing monitoring and trending of higher level functions - things that could be using multiple threads. @keith-turner provided some concrete examples. Knowing that the expected threads in the TabletGroupWatcher are running and possibly timing how long each run takes would allow metrics alerting and trending.

This is speculation and more of an description of something desired rather than a concrete example that I know happens. But assume that the thread handling user tablet assignments gets stuck or dies - if the manager keeps running then that is going to eventually be noticed through secondary effects - maybe its FATEs on table creates hang and backup or fail? Or its splits that start failing,... Exposing that function as a reportable metric could allow intervention sooner - or maybe it could be trended and if the thread starts taking longer and longer to run one could look what has changed and fix something before it falls over.

EdColeman avatar Mar 07 '22 19:03 EdColeman

I thought a good place to do one of these was at the new consistency check thread. See https://github.com/apache/accumulo/pull/2583

milleruntime avatar Mar 22 '22 18:03 milleruntime