go proposal: runtime: add per-goroutine CPU stats

Per-process CPU stats can currently be obtained via third-party packages like https://github.com/elastic/gosigar. However, I believe that there exists a need for a certain type of applications to be able to understand CPU usage at a finer granularity.

Example

At a high level in CockroachDB, whenever an application sends a query to the database, we spawn one or more goroutines to handle the request. If more queries are sent to the database, they each get an independent set of goroutines. Currently, we have no way of showing the database operator how much CPU is used per query. This is useful for operators in order to understand which queries are using more CPU and measure that against their expectations in order to do things like cancel a query that's using too many resources (e.g. an accidental overly intensive analytical query). If we had per-goroutine CPU stats, we could implement this by aggregating CPU usage across all goroutines that were spawned for that query.

Fundamentally, I think this is similar to bringing up a task manager when you feel like your computer is slow, figuring out which process on your computer is using more resources than expected, and killing that process.

Proposal

Add a function to the runtime package that does something like:

type CPUStats struct {
    user time.Duration
    system time.Duration
    ...
}

// ReadGoroutineCPUStats writes the active goroutine's CPU stats into
// CPUStats.
func ReadGoroutineCPUStats(stats *CPUStats)

Alternatives

From a correctness level, an alternative to offering these stats is to LockOSThread the active goroutine for exclusive thread access and then get coarser-grained thread-level cpu usage by calling Getrusage for the current thread. The performance impact is unclear.

Additional notes

Obtaining execution statistics during runtime at a fine-grained goroutine level is essential for an application like a database. I'd like to focus this conversation on CPU usage specifically, but the same idea applies to goroutine memory usage. We'd like to be able to tell how much live memory a single goroutine has allocated to be able to decide whether this goroutine should spill a memory-intensive computation to disk, for example. This is reminiscent of https://github.com/golang/go/issues/29696 but at a finer-grained level without a feedback mechanism.

I think that offering per-goroutine stats like this is useful even if it's just from an obervability standpoint. Any application that divides work into independent sets of goroutines and would like to track resource usage of a single group should benefit.

Sep 22 '20 13:09 asubiotto

A possible solution to showing high level usage of different query paths can be achieved by setting profiling labels on the goroutine: https://golang.org/src/runtime/pprof/runtime.go?s=1362:1432#L26

And doing background profiling on the job: https://go-review.googlesource.com/c/go/+/102755

Overall Go program usage can be queried from the enclosing container or process stats from the Operating system directly.

Sep 22 '20 14:09 martisch

Yes, this is exactly what labels are for. A nice thing about labels is that they let you measure CPU or heap performance across a range of goroutines all cooperating on some shared task.

https://golang.org/pkg/runtime/pprof/#Labels https://golang.org/pkg/runtime/pprof/#Do

Please let us know if you need something that is not addressed by labels.

Sep 22 '20 20:09 ianlancetaylor

Thanks for the suggestion. My main concern with profiling is that there is a non-negligible performance overhead. For example, running a quick workload (95% reads and 5% writes against a CockroachDB SQL table) shows that throughput drops by at least 8% when profiling with a one second interval.

I'm hoping that this information can be gathered by the scheduler in a much cheaper way since the question to answer is not "where has this goroutine spent most of its time" but "how much CPU time has this goroutine used". Would this even be feasible?

Sep 23 '20 11:09 asubiotto

Ah, I see. I would think that always collecting CPU statistics would be unreasonably expensive. But it does seem possible to collect them upon request in some way, at least when running on GNU/Linux. Every time a thread switched to a different goroutine it would call getrusage with RUSAGE_THREAD. The delta would be stored somewhere with the old goroutine. Then it could be retrieved as you suggest. Memory profiling information could be collected separately.

I don't know how useful this would be for most programs. In Go it is very easy to start a new goroutine, and it is very easy to ask an existing goroutine to do work on your behalf. That means that it's very easy for goroutine based stats to accidentally become very misleading, for example when the program forgets to collect the stats of some newly created goroutine. That is why runtime/pprof uses the labels mechanism.

Perhaps it would also be possible for this mechanism to use the labels mechanism. But then it is hard to see where the data should be stored or how it should be retrieved.

Sep 24 '20 00:09 ianlancetaylor

And doing background profiling on the job: https://go-review.googlesource.com/c/go/+/102755

Overall Go program usage can be queried from the enclosing container or process stats from the Operating system directly.

I'm curious what the internal experience with this mechanism is and how it is used exactly. I suppose some goroutine will be tasked with continuously reading the background profiling stream, but what does it do with it? Is there any code I can look at for how to get to "here's a stream of timestamped label maps" to "this label used X CPU resources"? Do we "just" create a profile every X seconds from the stream? And if that is the idea, how is that different from doing "regular" profiling at a lower sample rate (you might miss a beat every time you restart the sample, but let's assume this is ok)? I feel like I'm missing the bigger picture here.

For the record, I did get the background tracing PR linked above working on the go1.16 branch: https://github.com/cockroachdb/cockroach/pull/60795

Feb 19 '21 16:02 tbg

Ah, I see. I would think that always collecting CPU statistics would be unreasonably expensive. But it does seem possible to collect them upon request in some way, at least when running on GNU/Linux. Every time a thread switched to a different goroutine it would call getrusage with RUSAGE_THREAD. The delta would be stored somewhere with the old goroutine. Then it could be retrieved as you suggest. Memory profiling information could be collected separately.

With the RUSAGE_THREAD idea, wouldn't we be adding a syscall per scheduler tick? That seems very expensive and would likely be a non-starter for use cases such as ours where we want to track usage essentially at all times for most of the goroutines.

The most lightweight variant I have seen is based on nanotime() per scheduler ticks, as my colleague @knz prototyped here (the code just counts ticks, but imagine adding up the nanotime() instead of the ticks:

https://github.com/golang/go/compare/master...cockroachdb:taskgroup?expand=1

I understand that this basic approach has caveats (if threads get preempted, the duration of the preemption will be counted towards the goroutine's CPU time, and there's probably something about cgo calls too) but it seems workable and cheap enough to at least be able to opt into globally.

I don't know how useful this would be for most programs. In Go it is very easy to start a new goroutine, and it is very easy to ask an existing goroutine to do work on your behalf. That means that it's very easy for goroutine based stats to accidentally become very misleading, for example when the program forgets to collect the stats of some newly created goroutine. That is why runtime/pprof uses the labels mechanism.

The PR above avoids (if I understand you correctly) this by using propagation rules very similar to labels. A small initial set of goroutines (ideally just one, at the top of the "logical task") is explicitly labelled via the task group (identified by, say, an int64) to which the app holds a handle (for querying). The ID is inherited by child goroutines. Statistics are accrued at the level of the task group, not at the individual goroutine level (though goroutine level is possible, just create unique task group IDs). I recall that the Go team is averse to giving users goroutine-local storage by accident, which this approach would not (there would not need to be a way to retrieve the current task group from a goroutine) - given the task group ID, one can ask for its stats. But one can not ask a goroutine about its task group ID.

Perhaps it would also be possible for this mechanism to use the labels mechanism. But then it is hard to see where the data should be stored or how it should be retrieved.

I agree that getting labels and either of the new mechanisms proposed to play together would be nice, but it doesn't seem straightforward. You could assign to each label pair (key -> value) assigned a counter (i.e. two goroutines with both have label[k]=v share a task group <k,v>. If a goroutine with labels map[string]string{"l1": "foo", "l2": "bar"} ran for 10ms, we would accrue these to both of these label pairs, i.e. conceptually somewhere in the runtime m[tup{"l1", "bar"}] += 10 // ms and similarly for l2. Perhaps the way to access these metrics would be via the runtime/metrics package.

One difficulty is that we lose the simple identification of a task group with a memory address which we had before, because two maps with identical contents identify the same task group.

On the plus side, any labelled approach would avoid the clobbering of counters that could naively result if libraries did their own grouping (similar to how one should not call runtime.SetGoroutineLabels), as counters could be namespaced and/or nested.

You mention above that

A nice thing about labels is that they let you measure CPU or heap performance

but profiler labels are not used for heap profiles yet, and I think it's because of a somewhat similar problem - a goroutine's allocations in general outlive the goroutine, so the current mechanism (where the labels hang off the g) doesn't work and the lifecycle needs to be managed explicitly. (Everything I know about this I learned from https://github.com/golang/go/issues/23458#issuecomment-377973247). Maybe the way forward there opens a way forward for sane per-label-pair counters?

Feb 22 '21 15:02 tbg

The most lightweight variant I have seen is based on nanotime() per scheduler ticks, as my colleague @knz prototyped here (the code just counts ticks, but imagine adding up the nanotime() instead of the ticks:

The implementation using nanotime() is in fact ready already here: https://github.com/cockroachdb/go/commit/b08903328e01603dc1a620779ca9a926f056023e

Perhaps the way to access these metrics would be via the runtime/metrics package.

Done here: https://github.com/cockroachdb/go/commit/9020a4bfcf68d37c5b4d88627357ee0cff29137c

You could assign to each label pair (key -> value) assigned a counter (i.e. two goroutines with both have label[k]=v share a task group <k,v>

One elephant in the room is that looking up labels and traversing a Go map in the allocation hot path is a performance killer.

A nice thing about labels is that they let you measure CPU or heap performance

but profiler labels are not used for heap profiles yet,

This is the other elephant in the room: partitioning the heap allocator by profiler labels would run amok of the small heap optimization. (plus, it would be CPU-expensive to group the profiling data by labels)

Feb 22 '21 17:02 knz

In case it wasn't clear from the latest comments from @tbg and myself: we believe that doing things at the granularity of goroutines is too fine grained, and raises painful questions about where to accumulate the metrics when goroutines terminate.

While @tbg is trying to salvage pprof labels as the entity that defines a grouping of goroutines, I am promoting a separate "task group" abstraction which yields a simpler implementation and a lower run-time overhead. I don't know which one is going to win yet—we need to run further experiments—but in any case we don't want a solution that does stats per individual goroutines.

Feb 22 '21 17:02 knz

@knz I really like the feature in https://github.com/cockroachdb/go/commits/crdb-fixes. Could you create a pull request to go official?

May 13 '21 05:05 crazycs520

thank you for your interest! We're still evaluating whether this approach is useful in practice. Once we have a good story to tell, we'll share it with the go team.

May 13 '21 14:05 knz

Change https://go.dev/cl/387874 mentions this issue: runtime,runtime/metrics: track on-cpu time per goroutine

Feb 24 '22 14:02 gopherbot

@irfansharif sent https://go.dev/cl/387874 today which adds a /sched/goroutine/running:nanoseconds metric to runtime/metrics, which total time in _Grunning for the current goroutine. This is effectively equivalent to the ReadGoroutineCPUStats API from the first comment but hidden behind the runtime/metrics API.

FWIW, I don't think this API is a good fit for runtime/metrics, which typically returns the same results regardless of the calling goroutine [1]. Given the other metrics available, like /sched/latencies:seconds, I expected this to be a histogram of time spent in running, which I think could be useful, but not for the resource isolation requirements in this proposal.

[1] We also expect some monitoring systems to discover and export all metrics. This metric would be meaningless to export directly, as it would only report on a single reporter goroutine.

cc @golang/runtime @mknyszek

Feb 24 '22 16:02 prattmic

Thanks for the quick turnaround. I'm happy to go the route of ReadGoroutineCPUStats or a more direct GoroutineRunningNanos like API; I only went with runtime/metrics given the flexibility of the API and because direct APIs have commentary faboring the runtime/metrics variants. I would also be happy with the adding just the private grunningnanos() helper for external dependants like CRDB to go:linkname directly against, but I can see why that's unsatisfying in to include in the stdlib. If the nanos was just tracked in type g struct, that too is altogether sufficient.

We also expect some monitoring systems to discover and export all metrics. This metric would be meaningless to export directly, as it would only report on a single reporter goroutine.

Great point. When documenting that the metric is scoped only to the calling goroutine, I hoped that'd be sufficient for monitoring systems to know and filter out explicitly.

Feb 24 '22 17:02 irfansharif

On the CRDB side I've filed https://github.com/cockroachdb/cockroach/issues/77041 which I hope adds more color for as to why we're eager to get more observability into runtime internals.

Unrelated but perhaps of interest to the golang/runtime folks, we also use the # of runnable goroutines as an indicator for CPU saturation. We use this to power our admission control machinery for stability + better tail latencies. For that metric, we were able to get by peeking into already tracked runtime internals. For this running nanos however, I don't think we had similar allowances, hence the patch.

https://github.com/cockroachdb/cockroach/blob/0679c4cb6578/pkg/util/goschedstats/runnable.go#L22-L34

https://github.com/cockroachdb/cockroach/blob/0679c4cb6578/pkg/util/admission/granter.go#L29-L37

Feb 25 '22 16:02 irfansharif

For the number of runnable goroutines, that would be quite reasonable to add to runtime/metrics. I actually thought we had that, but it looks like we only have to total number of goroutines. It would perhaps make sense to add something like the /memory/classes hierarchy where we subdivide the total number of goroutines by state.

Feb 25 '22 19:02 aclements

For example, running a quick workload (95% reads and 5% writes against a CockroachDB SQL table) shows that throughput drops by at least 8% when profiling with a one second interval.

I find this incredibly surprising. I would expect the impact of profiling to be well under 1%. In fact, it's not obvious to me that statistical profiling would be more expensive than tracking time on every goroutine switch in an application that's heavy on goroutine scheduling.

If profiling didn't have this overhead, would it make sense to use profiling labels plus a stream of profile events to accumulate the information you need?

Feb 25 '22 19:02 aclements

If profiling didn't have this overhead, would it make sense to use profiling labels plus a stream of profile events to accumulate the information you need?

We attempted something similar in https://github.com/cockroachdb/cockroach/pull/60508. I can try re-running a stock workload with profiling enabled to measure what the overhead is, perhaps things have changed since that 2020 comment and/or we didn't measure correctly. I'll do it next week.

In fact, it's not obvious to me that statistical profiling would be more expensive than tracking time on every goroutine switch in an application that's heavy on goroutine scheduling.

For my own understanding:

wouldn't tracking time on goroutines be faster than profiling since we'd avoid logging the entire stack trace? This presents also as memory overhead.
wouldn't it be slower to compute the total on-CPU time for a trace by analyzing the stream of profile events instead of reading off of a counter?
does the overhead for https://go-review.googlesource.com/c/go/+/387874/ come primarily from the use of nanotime() on every goroutine switch? Or are we also worried about the two added fields to type g struct? I didn't know how to evaluate the latter. Given we invoke nanotime() today every 8th switch today to track scheduler latencies, is this still a concern?

Feb 25 '22 21:02 irfansharif

To add to what Irfan has said, the inability to opt-in to per goroutine CPU tracking is a critical gap in the Go runtime for us. We're building a Serverless database where we charge users per request, and doing that well requires finer granularity of CPU tracking. Using the profiling functionality isn't a good match for this scenario, because we want a more precise measurement, not a statistical sample. Simple prepared queries can take just a few hundred microseconds to execute, which (as I understand it) is below the sampling granularity. In addition, charging usage based on samples is a blunt instrument.

While the details of Irfan's patch may not fit the Go runtime patterns well enough, can we get guidance on what would work better? As Irfan said, the current code calls nanotime every 8th goroutine switch. Even if the Go team is not comfortable with calling nanotime on every switch as in this patch, perhaps there could be an opt-in mechanism that enables apps to set gTrackingPeriod equal to 1 (and makes it a variable instead of a constant), so that only those apps pay the overhead (and are you sure calling nanotime here is actually a performance bottleneck to begin with?).

Mar 17 '22 04:03 andy-kimball

For server applications that often have to actually measure this on individual requests for accounting/loadbalancing/etc. purposes, it's good to know this programmatically. As I recall, Google has this as a class in C++ that uses the Linux performance counters support to give per-thread/fiber user/system CPU time used between specified points in a thread/fiber's execution. Can this approach be adopted for Golang, as many companies, like Cockroach, are running into this in larger-scale systems?

Apr 01 '22 15:04 g-talbot

One other comment I'll make about profiling is that, unless you're using some sort of reservoir sampling approach, that the profiling will give you a great picture of the common case, but miss important edge cases. This is because the common cases will dominate in the sampling. It's not unusual to need to be able to find events in the 3 or 4 9's (or more) that have vastly different performance characteristics that never get sampled, but when they happen, cause the rest of the system to "hiccup" and have inordinate effect. Having a way to directly measure goroutines this way helps with that too.

Apr 01 '22 15:04 g-talbot

I also wanted to get the temperature of the Go team around putting some Linux perf events around the scheduler. In particular, a few days ago, this user events patch was merged into the kernel allowing user space programs to register perf events that can be dynamically enabled and disabled from outside the program through the tracefs file system. The program can check whether a particular event is enabled and, if it is, it outputs data to trace by writing to a tracefs file. The events can then be consumed through standard interfaces (perf, BPF, etc). In theory, runtime scheduler events outputted to perf can be joined with kernel scheduler events, and true on-cpu time can be computed for goroutines. I don't know the cost of producing the respective events when enabled (the cost if basically zero when disabled).

Irrespective of other decisions for adding Go native APIs for introspection discussed in this proposal, would a patch exploring the perf path be welcomed?

Apr 01 '22 15:04 andreimatei

@andreimatei I think supporting something like that would be of great interest to us.

The devil is in the details however. Like, it would be great to just dump the execution trace (runtime/trace) events into a kernel trace and be able to correlate everything together. But how exactly would that be enabled? Maybe a if a GODEBUG setting is present, StartTrace actually starts a kernel trace instead? I'm honestly not sure how well that fits with how user tracepoints are modeled. I'm interested in hearing and/or seeing more details about what you had in mind.

Apr 14 '22 18:04 mknyszek

Maybe a if a GODEBUG setting is present, StartTrace actually starts a kernel trace instead?

@mknyszek what I had in mind was to add some type of userspace tracepoints into the scheduler, making available the scheduler events as well as different arguments to tracers such as perf. For the use cases discussed in this issue, I imagine that the profiler labels of the goroutines involved would be good arguments to include in the tracepoints, as they can be used by the go program to attach arbitrary information to its goroutines. Then, one could use standard Linux techniques perf record, perf script, etc to collect the data at runtime, analyze it, and profit. Depending on exactly how we do it, BPF programs could also be attached to these tracepoints (courtesy of the Linux tracing infrastructure), which opens the door to collecting more arbitrary information from the running program. I would try to not have anything control by a GODEBUG, but instead have the probes always available for a tracer to attach to them.

As you perhaps suggest, demonstrating that an execution trace equivalent to runtime/trace can be generated purely based on these tracepoints, and additionally that it can contain information invisible to the runtime around the kernel thread scheduling, would be very cool.

There are different details to work out. For example, what kind of tracepoints would we use. The new user events type that I mentioned in my previous message seem to have quickly fallen out of favor in the kernel; I think they might end up being removed. But, there are other types. In particular, there's DTrace USDTs, which are supported by the various Linux tools. These also have a feature (USDT semaphores) that makes them practically zero cost when tracing is not enabled. Except USDTs are not natively available to Go programs, I believe, or event to the Go runtime (as you need the compiler and linker to cooperate). There are tricks available out there, but I don't know how suitable they are for the runtime and whether they preserve the zero-cost property. (Btw, I think that supporting USDTs would generally be a major improvement to Go). In theory, for a fixed version of the runtime, we could also use dynamic uprobes to extract the information we want. These don't need any changes to the runtime, but reading the arguments from memory is tedious. However, a simple noinline dummy function that promises a somewhat stable interface and groups the arguments together for the only purpose of them being easily readable by a probe attached to that function could be workable.

Apr 27 '22 18:04 andreimatei

I can try re-running a stock workload with profiling enabled to measure what the overhead is, perhaps things have changed since that 2020 comment and/or we didn't measure correctly. I'll do it next week.

Apologies for being slow. The profiling overhead is not as stark as @asubiotto (hi Alfonso!) originally described, so perhaps we did measure things incorrectly.

would it make sense to use profiling labels plus a stream of profile events to accumulate the information you need?

In addition to @andy-kimball's point around wanting to measure at a granularity finer than 100Hz/10ms for microsecond accounting, another reason to not rely on the profile approach is its inaccuracy (profile event proportions for stack traces don't match actual on-CPU proportions) and imprecision (repeated measurements have a lot of variability) that we'd inherit if looking to build control loops on top of it; I'm borrowing the verbiage and experimental results from https://github.com/golang/go/issues/36821. I was also curious to see how measured CPU time would fare with tests in that proposal, so I tried the following two things (available at https://github.com/irfansharif/runner):

... (Click here to expand)

// TestEquivalentGoroutines is a variant of the "parallel test" in
// https://github.com/golang/go/issues/36821. It tests whether goroutines that
// (should) spend the same amount of time on-CPU have similar measured on-CPU
// time.
func TestEquivalentGoroutines(t *testing.T) {
  mu := struct {
    sync.Mutex
    nanos map[int]int64
  }{}
  mu.nanos = make(map[int]int64)

  f := func(wg *sync.WaitGroup, id int) {
    defer wg.Done()

    var sum int
    for i := 0; i < 500000000; i++ {
      sum -= i / 2
      sum *= i
      sum /= i/3 + 1
      sum -= i / 4
    }

    nanos := grunningnanos()
    mu.Lock()
    mu.nanos[id] = nanos
    mu.Unlock()
  }

  var wg sync.WaitGroup
  for i := 0; i < 10; i++ {
    i := i // copy loop variable
    wg.Add(1)
    go f(&wg, i)
  }
  wg.Wait()

  mu.Lock()
  defer mu.Unlock()

  total := int64(0)
  for _, nanos := range mu.nanos {
    total += nanos
  }

  minexp, maxexp := float64(0.085), float64(0.115)
  for i, nanos := range mu.nanos {
    got := float64(nanos) / float64(total)

    assert.Greaterf(t, got, minexp,
      "expected proportion > %f, got %f", minexp, got)
    assert.Lessf(t, got, maxexp,
      "expected proportion < %f, got %f", maxexp, got)

    t.Logf("%d's got %0.2f%% of total time", i, got*100)
  }
}


// TestProportionalGoroutines is a variant of the "serial test" in
// https://github.com/golang/go/issues/36821. It tests whether goroutines that
// (should) spend a proportion of time on-CPU have proportionate measured on-CPU
// time.
func TestProportionalGoroutines(t *testing.T) {
  f := func(wg *sync.WaitGroup, v uint64, trip uint64, result *int64) {
    defer wg.Done()

    ret := v
    for i := trip; i > 0; i-- {
      ret += i
      ret = ret ^ (i + 0xcafebabe)
    }

    nanos := grunningnanos()
    atomic.AddInt64(result, nanos)
  }

  results := make([]int64, 10, 10)
  var wg sync.WaitGroup

  for iters := 0; iters < 10000; iters++ {
    for i := uint64(0); i < 10; i++ {
      i := i // copy loop variable
      wg.Add(1)
      go f(&wg, i+1, (i+1)*100000, &results[i])
    }
  }

  wg.Wait()

  total := int64(0)
  for _, result := range results {
    total += result
  }

  initial := float64(results[0]) / float64(total)
  maxdelta := float64(0.5)
  for i, result := range results {
    got := float64(result) / float64(total)
    mult := got / initial
    assert.InDelta(t, float64(i+1), mult, maxdelta)

    t.Logf("%d's got %0.2f%% of total time (%fx)", i, got*100, mult)
  }
}

The results suggest that the measured on-CPU time is both accurate and precise:

=== RUN   TestEquivalentGoroutines
    0's got  9.98% of total time
    1's got  9.53% of total time
    2's got  9.22% of total time
    3's got 10.42% of total time
    4's got  9.84% of total time
    5's got 10.43% of total time
    6's got 10.50% of total time
    7's got 10.21% of total time
    8's got 10.03% of total time
    9's got  9.86% of total time

=== RUN   TestProportionalGoroutines
    0's got  1.87% of total time (1.000000x)
    1's got  3.60% of total time (1.931999x)
    2's got  5.41% of total time (2.899312x)
    3's got  7.21% of total time (3.864451x)
    4's got  9.11% of total time (4.880925x)
    5's got 10.94% of total time (5.864723x)
    6's got 12.77% of total time (6.842004x)
    7's got 14.34% of total time (7.685840x)
    8's got 16.58% of total time (8.885060x)
    9's got 18.18% of total time (9.741030x)

the inability to opt-in to per goroutine CPU tracking is a critical gap in the Go runtime for us.

does the overhead for https://go-review.googlesource.com/c/go/+/387874/ come primarily from the use of nanotime() on every goroutine switch?

How do the package maintainers feel about making https://github.com/golang/go/pull/51347 opt-in with zero overhead if opted out? ~We could for e.g. only track the nanoseconds if a GODEBUG var is set, relying on branch prediction to get it right ~always + never invoke the additional nanotime() if unset.~ EDIT: Actually, opting-in/out through GODEBUG wouldn't work as we want CRDB binaries to always opt-into this tracking without needing the envvar to be set. The zero overhead approach would still work if we introduced an API to switch this tracking on/off (but that of course is not a net-zero API change).

May 13 '22 00:05 irfansharif

The report in #36821 reflects bugs in / shortcomings of Go's CPU profiling on Linux as of early 2020. @irfansharif , have you found that it is still inaccurate on Linux as of Go 1.18, or is it possible that the runtime/pprof sampling profiler could give acceptable results for on-CPU time?

Some of the earlier comments here pointed to map-based goroutine labels adding unacceptable computation overhead. If those were more efficient, would goroutine labels plus the sampling profiler work well enough?

There's also the question from @andy-kimball of granularity, in particular for billing customers for their CPU time. I don't fully understand why sampling isn't an option, even for (repeated) operations that take only a few hundred microseconds each. It seems like the work of repeated operations would show up in over 1000 samples (at 100 Hz) well before it's consumed $0.01 of CPU time (at current cloud prices).

Overall it seems to me that the current tools we have are close, so I'm interested in how we can improve them enough to be useful here.

May 13 '22 17:05 rhysh

We would like to get to the point where a user can run an EXPLAIN ANALYZE and get back a cost for that single statement. This should work on a dev/staging cluster (i.e. with little or no other load), and give results that are similar to a production cluster that is fully loaded. Sampling would not work well for that.

Similarly, we'd like to be able to show a list of all recent SQL statements that have been run and show the CPU consumed by each. While some statements may have thousands of iterations and show an accurate number, there's often a "long tail" of other SQL statements that have only a few iterations. Those would often show 0, since we didn't happen to get a "hit". While it's perhaps better than nothing, we're trying to enable a better customer experience than that.

May 14 '22 02:05 andy-kimball

Collecting a timestamp every time the scheduler starts or stops a goroutine, tracking goroutines' running time to sub-microsecond levels, sounds close to key features of the execution tracer (runtime/trace). That data stream also describes when goroutines interact with each other, which allows tracking which parts of the work each goroutine does are for particular inbound requests.

Rather than build up additional ways to see data that's already available via runtime/trace, are there ways to pare down or modify the execution tracer's view to something with lower complexity and overhead? I have guesses (below), but I'm curious for your view on how the execution tracer falls short today for your needs.

Is it:

CPU overhead (from collecting timestamps, from collecting stack traces, writing out the byte stream, or some other part) is too high?
Expensive startup (does work proportional to the number of living goroutines)?
Runtime does not flush partial results in a timely manner?
Inconvenient to parse?

May 16 '22 19:05 rhysh

@rhysh I am currently having similar thoughts, and I think a low-enough overhead, user-parseable trace (with some way of inserting markers at the user level) sounds like it could potentially resolve this situation, assuming it's OK to stream this data out and analyze it programmatically out-of-band.

For this to work, though, the trace format itself has to be more scalable and robust, too. On scalability, the traces come out to be pretty large, and they unfortunately basically need to be read fully into memory. On robustness, you really always need a perfect trace to get any useful information out (or at least, the tools back out if it's not perfect). For instance, partial traces don't exist as a concept because there's no enforcement that, say, a block of really old events actually appears early in the trace, forcing you to always deal with the whole trace (from StartTrace to StopTrace).

I'm currently thinking a bit about tracing going forward, and this has been on my mind. No promises of course, but Go's tracing story in general could use some improvements.

May 16 '22 19:05 mknyszek

What would it take to get this proposal moved into the Active column of the Proposals board?

Jun 02 '22 22:06 ajwerner

Anyone interested in tracing, I can't recommend highly enough Dick Sites's new book Understanding Software Dynamics.

For an intro see his Stanford talk.

It seems to me that really good tracing along these lines would help a lot more than scattered per-goroutine CPU stats.

Jun 22 '22 18:06 rsc

go go copied to clipboard

proposal: runtime: add per-goroutine CPU stats

Example

Proposal

Alternatives

Additional notes

go
go copied to clipboard