opentelemetry-android icon indicating copy to clipboard operation
opentelemetry-android copied to clipboard

Monitoring relative cpu utilization for client spans

Open Doohl opened this issue 4 months ago • 4 comments

Current spans and events don't provide enough context to tell you, for example, why an AppStart or activity Created / Resumed span is taking a long time. Currently the only way you could reliably figure out why a span has elevated duration is by instrumenting additional sub-spans yourself.

In a very generic sense, any of your application's spans could be delayed for the following reasons:

  1. You have complex or deeply-nested view hierarchies that bog down the Android UI pipeline [Android App quality article].
    • In reality this would mostly affect draw times, but that's a separate issue altogether.
  2. There are expensive operations happening in the process / UI thread that are slowing things down. The easiest way to get a measure on this is to monitor CPU performance (through a CPU utilization gauge or metric).

We could benefit from a SpanProcessor that can sample relative, average CPU utilization for the duration of a span and append the following attributes:

  • process.cpu.avg_utilization - The relative, average CPU utilization for the app process during the span
  • process.cpu.elapsed_time_start_millis - The elapsed CPU time at the start of this span
  • process.cpu.elapsed_time_end_millis - The elapsed CPU time at the end of this span

How do we do this?

The calculation is very straightforward:

cpuUtilizationAvg = 100 * (spanElapsedCpuDuration / spanDuration) / numCores

Calculations, overall strategy, taken from a Lyft Engineering article on this subject.

We would also need to take advantage of the experimental ExtendedSpanProcessor introduced in an upstream PR (https://github.com/open-telemetry/opentelemetry-java/pull/6367) to append the new attributes before a span ends via the onEnding callback provided in the new SpanProcessor type.

Doohl avatar Aug 12 '25 06:08 Doohl

Prototype code (should give you an idea of how this can be implemented simply):

class CpuAttributesSpanProcessor : ExtendedSpanProcessor {
    companion object {
        const val CPU_AVERAGE_ATTRIBUTE = "process.cpu.avg_utilization"
        const val CPU_ELAPSED_TIME_START = "process.cpu.elapsed_time_start_millis"
        const val CPU_ELAPSED_TIME_END = "process.cpu.elapsed_time_end_millis"
    }

    private val numberOfCores = Runtime.getRuntime().availableProcessors()

    override fun onStart(
        parentContext: Context,
        span: ReadWriteSpan,
    ) {
        span.setAttribute(CPU_ELAPSED_TIME_START, android.os.Process.getElapsedCpuTime())
    }

    override fun isStartRequired(): Boolean = true

    override fun onEnd(span: ReadableSpan) {}

    override fun isEndRequired(): Boolean = true

    override fun isOnEndingRequired(): Boolean = true

    override fun onEnding(span: ReadWriteSpan) {
        val startCpuTime =
            span.getAttribute(AttributeKey.longKey(CPU_ELAPSED_TIME_START))
                ?: return
        val endCpuTime = android.os.Process.getElapsedCpuTime()
        val cpuTimeMs = (endCpuTime - startCpuTime).toDouble()
        val spanDurationMs = (span.latencyNanos / 1_000_000).toDouble()

        if (spanDurationMs > 0) {
            val cpuUtilization = (cpuTimeMs / spanDurationMs) * 100.0 / numberOfCores.toDouble()
            span.setAttribute(CPU_AVERAGE_ATTRIBUTE, cpuUtilization)
        }
        span.setAttribute(CPU_ELAPSED_TIME_END, endCpuTime)
    }
}

(Will PR a working prototype soon!)

Doohl avatar Aug 12 '25 07:08 Doohl

idea: Could android.os.Debug.threadCpuTimeNanos() give us a more reliable / accurate view on CPU time?

In Android SIG meeting today we discussed the possibility of android.os.Process.getElapsedCpuTime() not really providing an accurate measure of actual time spent on the CPU. The documentation implies that it could be a measure of overall process running time.

Doohl avatar Aug 12 '25 17:08 Doohl

Separate problem: Can a single span do work across multiple spans? If so, then CPU measure isn't really useful.

Doohl avatar Aug 12 '25 17:08 Doohl

After some research, I've determined that android.os.Process.getElapsedCpuTime() does return what it should: https://android.googlesource.com/platform/frameworks/base/+/donut-release/core/jni/android_util_Process.cpp#823

It uses the clock_gettime POSIX function using the CLOCK_PROCESS_CPUTIME_ID clock id, which returns a high-resolution per-process CPU time.

Doohl avatar Aug 14 '25 00:08 Doohl