accumulo Consider Adding Metrics to Track CPU and I/O Utilization for Better Performance Monitoring

Exposing the CPU and I/O utilization of methods that are likely to consume the most of those resources could provide insights into overall potential performance improvements to Accumulo. Further, exposing such metrics at the table level could help identify tables with inefficient schema, inefficient bulk import strategies, or have iterators or aggregators that are resource hogs.

This issue is to capture the methods or threads that are most likely to have high CPU or high I/O utilization.

May 03 '19 17:05 kubina

@kubina This issue seems very vague, and sounds a bit more like something a general purpose Java profiler tool would be for, rather than something specific to Accumulo. Do you have a specific, actionable step in mind that an Accumulo developer who works on this issue could begin working on that is in scope of Accumulo? If there's not something specific for somebody to work on for this, this ticket seems likely to be passed over and just sit here idle for a long time.

May 06 '19 20:05 ctubbsii

This issue seems very vague, and sounds a bit more like something a general purpose Java profiler tool would be for, rather than something specific to Accumulo.

@ctubbsii A Java profiler will not give the granularity (like at the table level) that is sought from the metrics. Also, I want to collect the metrics from a production system and I would not recommend using a profiler on such a system.

Do you have a specific, actionable step in mind that an Accumulo developer who works on this issue could begin working on that is in scope of Accumulo?

@ctubbsii Working on that aspect of the ticket but also looking for input from the developers about the methods of a tserver that are mostly likely to utilize the most CPU and I/O.

May 07 '19 15:05 kubina

Your best chance for feedback, if that's what you desire here, is to avoid passive voice declarative sentences, and instead ask specific and direct questions (possibly on the dev mailing list instead of here). Since there isn't anything concrete here for somebody to act upon, I'm not sure how useful this ticket is going to be.

May 07 '19 17:05 ctubbsii

@kubina Apologies if my previous comment came across poorly. I had intended it to be a friendly suggestion for how to word this issue to clarify what you were asking, but I realize now it probably didn't come across as very friendly. Sorry for that.

May 07 '19 23:05 ctubbsii

@ctubbsii No problem. Ed Coleman recommended a create this ticket similar to his to collect the FATE metrics. I will also email the dev list for input.

May 08 '19 12:05 kubina

I have created two issues #1133 and #1134 that are related and may be the first steps towards addressing some of this ticket.
While those issues will not resolve this issue - it seems to me that they can form a cleaner baseline for moving forward. I expect that, as this issue evolves, "sub-issues" will be created to address specific items - so for now, I'm thinking this issue is capturing things at an "epic" level rather that a pure task.

May 16 '19 16:05 EdColeman

#2305 updates Accumulo 2.1.0 to use micrometer. There is an option to include JVM metrics, but I don't think it will include the information requested. It appears that something like this was previously requested of Micrometer, but the issue is still open.

Dec 21 '21 18:12 dlmarion

jvm metrics would not provide the detail that this is asking for. If I recall past discussions there have been asks to provide metrics for such things as being able to measure (at a high level)

cpu and I/O with / without encryption
cpu and I/O with alternate compute resource (GPU?)
being able to measure impact a larger tserver or multiple tservers on a node are used.

Dec 21 '21 19:12 EdColeman

But this issue is for

This issue is to capture the methods or threads that are most likely to have high CPU or high I/O utilization.

Dec 21 '21 19:12 dlmarion

The threads or methods used in the high level use cases I mentioned would be the items to instrument with metrics so that info could be derived in production. (Well, in theory)

Dec 21 '21 19:12 EdColeman

You can use OS measurement tools to get IO statistics for single drives, but how do you measure IO usage of a Java application from within the application? You could, in theory, get the cpu time of the thread at the beginning and end of a method and report the difference. I'm not sure how costly that would be.

Dec 21 '21 19:12 dlmarion

Regarding IO, in the HDFS case, you should be able to get metrics from Hadoop itself. Example: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cm_metrics_datanode.html

Dec 21 '21 19:12 dlmarion

It looks like RateLimitedInputStream wraps the FSDataInputStream at CacheableBlockFile.Reader. We could emit a metric there in the RateLimitedInputStream methods to indicate the length of time it takes to seek or read from the underlying HDFS file InputStream. This is an example of something that we could do, but I'm not sure of how much value it will be.

Dec 22 '21 18:12 dlmarion