Monitoring relative cpu utilization for client spans
Current spans and events don't provide enough context to tell you, for example, why an AppStart or activity Created / Resumed span is taking a long time. Currently the only way you could reliably figure out why a span has elevated duration is by instrumenting additional sub-spans yourself.
In a very generic sense, any of your application's spans could be delayed for the following reasons:
- You have complex or deeply-nested view hierarchies that bog down the Android UI pipeline [Android App quality article].
- In reality this would mostly affect draw times, but that's a separate issue altogether.
- There are expensive operations happening in the process / UI thread that are slowing things down. The easiest way to get a measure on this is to monitor CPU performance (through a
CPU utilizationgauge or metric).
We could benefit from a SpanProcessor that can sample relative, average CPU utilization for the duration of a span and append the following attributes:
process.cpu.avg_utilization- The relative, average CPU utilization for the app process during the spanprocess.cpu.elapsed_time_start_millis- The elapsed CPU time at the start of this spanprocess.cpu.elapsed_time_end_millis- The elapsed CPU time at the end of this span
How do we do this?
The calculation is very straightforward:
cpuUtilizationAvg = 100 * (spanElapsedCpuDuration / spanDuration) / numCores
Calculations, overall strategy, taken from a Lyft Engineering article on this subject.
We would also need to take advantage of the experimental ExtendedSpanProcessor introduced in an upstream PR (https://github.com/open-telemetry/opentelemetry-java/pull/6367) to append the new attributes before a span ends via the onEnding callback provided in the new SpanProcessor type.
Prototype code (should give you an idea of how this can be implemented simply):
class CpuAttributesSpanProcessor : ExtendedSpanProcessor {
companion object {
const val CPU_AVERAGE_ATTRIBUTE = "process.cpu.avg_utilization"
const val CPU_ELAPSED_TIME_START = "process.cpu.elapsed_time_start_millis"
const val CPU_ELAPSED_TIME_END = "process.cpu.elapsed_time_end_millis"
}
private val numberOfCores = Runtime.getRuntime().availableProcessors()
override fun onStart(
parentContext: Context,
span: ReadWriteSpan,
) {
span.setAttribute(CPU_ELAPSED_TIME_START, android.os.Process.getElapsedCpuTime())
}
override fun isStartRequired(): Boolean = true
override fun onEnd(span: ReadableSpan) {}
override fun isEndRequired(): Boolean = true
override fun isOnEndingRequired(): Boolean = true
override fun onEnding(span: ReadWriteSpan) {
val startCpuTime =
span.getAttribute(AttributeKey.longKey(CPU_ELAPSED_TIME_START))
?: return
val endCpuTime = android.os.Process.getElapsedCpuTime()
val cpuTimeMs = (endCpuTime - startCpuTime).toDouble()
val spanDurationMs = (span.latencyNanos / 1_000_000).toDouble()
if (spanDurationMs > 0) {
val cpuUtilization = (cpuTimeMs / spanDurationMs) * 100.0 / numberOfCores.toDouble()
span.setAttribute(CPU_AVERAGE_ATTRIBUTE, cpuUtilization)
}
span.setAttribute(CPU_ELAPSED_TIME_END, endCpuTime)
}
}
(Will PR a working prototype soon!)
idea: Could android.os.Debug.threadCpuTimeNanos() give us a more reliable / accurate view on CPU time?
In Android SIG meeting today we discussed the possibility of android.os.Process.getElapsedCpuTime() not really providing an accurate measure of actual time spent on the CPU. The documentation implies that it could be a measure of overall process running time.
Separate problem: Can a single span do work across multiple spans? If so, then CPU measure isn't really useful.
After some research, I've determined that android.os.Process.getElapsedCpuTime() does return what it should: https://android.googlesource.com/platform/frameworks/base/+/donut-release/core/jni/android_util_Process.cpp#823
It uses the clock_gettime POSIX function using the CLOCK_PROCESS_CPUTIME_ID clock id, which returns a high-resolution per-process CPU time.