go proposal: testing: add Keep, to force evaluation in benchmarks

Benchmarks frequently need to prevent certain compiler optimizations that may optimize away parts of the code the programmer intends to benchmark. Usually, this comes up in two situations where the benchmark use of an API is slightly artificial compared to a “real” use of the API. The following example comes from @davecheney's 2013 blog post, How to write benchmarks in Go, and demonstrates both issues:

func BenchmarkFib10(b *testing.B) {
	// run the Fib function b.N times
	for n := 0; n < b.N; n++ {
		Fib(10)
	}
}

Most commonly, the result of the function under test is not used because we only care about its timing. In the example, since Fib is a pure function, the compiler could optimize away the call completely. Indeed, in “real” code, the compiler would often be expected to do exactly this. But in benchmark code, we’re interested only in the side-effect of the function’s timing, which this optimization would destroy.
An argument to the function under test may be unintentionally constant-folded into the function. In the example, even if we addressed the first issue, the compiler may compute Fib(10) entirely at compile time, again destroying the benchmark. This is more subtle because sometimes the intent is to benchmark a function with a particular constant-valued argument, and sometimes the constant argument is simply a placeholder.

There are ways around both of these, but they are difficult to use and tend to introduce overhead into the benchmark loop. For example, a common workaround is to add the result of the call to an accumulator. However, there’s not always a convenient accumulator type, this introduces some overhead into the loop, and the benchmark must then somehow ensure the accumulator itself doesn’t get optimized away.

In both cases, these optimizations can be partial, where part of the function under test is optimized away and part isn’t, as demonstrated in @eliben’s example. This is particularly subtle because it leads to timings that are incorrect but also not obviously wrong.

Proposal

I propose we add the following function to the testing package:

package testing

// Keep returns its argument. It ensures that its argument and result
// will be evaluated at run time and treated as non-constant.
// This is for use in benchmarks to prevent undesired compiler optimizations.
func Keep[T any](v T) T

(This proposal is an expanded and tweaked version of @randall77’s comment.)

The Keep function can be used on the result of a function under test, on arguments, or even on the function itself. Using Keep, the corrected version of the example would be:

func BenchmarkFib10(b *testing.B) {
	// run the Fib function b.N times
	for n := 0; n < b.N; n++ {
		testing.Keep(Fib(testing.Keep(10)))
	}
}

(Or testing.Keep(Fib)(10), but this is subtle enough that I don’t think we should recommend this usage.)

Unlike various other solutions, Keep also lets the benchmark author choose whether to treat an argument as constant or not, making it possible to benchmark expected constant folding.

Alternatives

Keep may not be the best name. This is essentially equivalent to Rust’s black_box, and we could call it testing.BlackBox. Other options include Opaque, NoOpt, Used, and Sink.
#27400 asks for documentation of best practices for avoiding unwanted optimization. While we could document workarounds, the basic problem is Go doesn’t currently have a good way to write benchmarks that run afoul of compiler optimizations.
#48768 proposes testing.Iterate, which forces evaluation of all arguments and results of a function, in addition to abstracting away the b.N loop, which is another common benchmarking mistake. However, its heavy use of reflection would be difficult to make zero or even low overhead, and it lacks static type-safety. It also seems likely that users would often just pass a func() with the body of the benchmark, negating its benefits for argument and result evaluation.
runtime.KeepAlive can be used to force evaluation of the result of a function under test. However, this isn’t the intended use and it’s not clear how this might interact with future optimizations to KeepAlive. It also can’t be used for arguments because it doesn’t return anything. @cespare has some arguments against KeepAlive in this comment.

Jul 05 '23 16:07 aclements

Bikeshedding the name aside (Keep SGTM), I really like this proposal.

It's simple and sufficient. It doesn't prevent us from working on a different API like the proposed Iterate in the future.

Jul 05 '23 16:07 eliben

If we planned to do Iterate, we might not want to also do Keep. That said, I think the drawbacks listed above for Iterate are quite serious and we should simply not do it.

Jul 05 '23 16:07 rsc

It doesn't prevent us from working on a different API like the proposed Iterate in the future.

FWIW, I'm planning to file another proposal for an API that covers just the looping aspect of Iterate and would complement Keep.

Jul 05 '23 17:07 aclements

Doing Iterate would cause a lot of churn as all the old benchmarks are rewritten to be in the new style. They would have different nesting depths, which make the diffs harder to ignore. Adding testing.Keep (or should it be b.KeepAlive()?) would cause minimal churn as just a bunch of globalSinks and runtime.KeepAlives would be replaced and only those specific lines would be affected.

Jul 05 '23 17:07 earthboundkid

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

Jul 05 '23 18:07 rsc

Unlike various other solutions, Keep also lets the benchmark author choose whether to treat an argument as constant or not, making it possible to benchmark expected constant folding.

Note that that is also possible with the API proposed in #48768 by closing over the intentionally-constant values:

func BenchmarkFibConstant10(b *testing.T) {
	b.Iterate(func() int {
		return Fib(10)
	})
}

It seems to me that this proposal and #48768 are equally expressive, and the key difference is just whether constant-propagation is opt-in (#48768) or opt-out (this proposal).

Jul 05 '23 18:07 bcmills

However, [Iterate's] heavy use of reflection would be difficult to make zero or even low overhead

As I have repeatedly stated on #48768, I believe that there are several viable ways to overcome that overhead.

I am becoming somewhat frustrated that https://github.com/golang/go/issues/48768#issuecomment-937003496 in particular seems to have been ignored. I may not be on the Go compiler team, but I am well acquainted with compiler optimization techniques, and so far nobody has explained why those techniques would not apply in this case.

Jul 05 '23 18:07 bcmills

and it lacks static type-safety.

While that is true, any type mismatch errors would be diagnosed immediately if the benchmark is ever actually run, and a similar lack of type safety was not seen as a significant barrier for the closely-related fuzz testing API (#44551).

Jul 05 '23 18:07 bcmills

@bcmills, my experience over >25 frustrating years of trying to benchmark things is that, in general, attempting to subtract out per-loop overhead sounds good in theory, but in practice that overhead can and often does include various random noise. And the more overhead there is, the louder the noise. This means if you are trying to benchmark a very short operation, then subtracting out a (different) reflect.Call measurement is very likely to destroy the measurement, perhaps even making it negative. The best approach we have for getting the most reliable numbers we can is to introduce as little overhead as possible to begin with.

For the trivial loop for i := 0; i < b.N; i++, we just ignore the overhead of the i++, i < N entirely and include it as part of the thing being measured. This turns out to be far more accurate than trying to subtract it out.

Jul 05 '23 18:07 rsc

	testing.Keep(Fib(testing.Keep(10)))

From what I can tell, this would require N+1 calls to Keep in order to benchmark a function with N arguments. Although N is usually fairly small, that still seems like a very noisy call site for even a modest number of arguments.

Jul 05 '23 18:07 bcmills

The main place where testing.Keep is needed is around the overall result. I write code to work around that all the time. It is far less common to need to worry about making the arguments opaque. I can't remember ever doing that.

Jul 05 '23 19:07 rsc

I see now that you also mentioned making b.Iterate a compiler intrinsic. I suppose that is possible, but it seems very special-case. At that point it's basically a back-door language change, since either you can't do x := b.Iterate; x(Fib, 10) or it does something very different from b.Iterate(Fib, 10).

Jul 05 '23 19:07 rsc

It is far less common to need to worry about making the arguments opaque. I can't remember ever doing that.

I expect that that will become more common as the compiler gets better at inlining. That said, it is also more straightforward to work around (without new API) today, such as by alternating among multiple entries in a slice of inputs.

Jul 05 '23 19:07 bcmills

I agree that subtracting out the overhead from a naive implementation based on reflect.Call does not seem viable.

Making b.Iterate itself a compiler intrinsic is one possible alternative, although I agree that the implication for Iterate as a method-value is unfortunate.

I think probably the most promising approach is an implementation that sets up a stack frame with arguments and then repeatedly invokes the function starting from that frame. It isn't obvious to me whether the reflect.Caller API in #49340 is sufficient for that or if it would need some other hook, but even in that case the hook could be provided in internal/reflectlite or a similar internal package.

Jul 05 '23 19:07 bcmills

The stack frame implementation would not be able to set up the arguments just once. It would have to set them up on every iteration, since in general a function can overwrite its arguments, and many do. reflect.Caller would amortize the allocation but not the setup.

Jul 05 '23 19:07 rsc

All good points, @bcmills.

From what I can tell, this would require N+1 calls to Keep in order to benchmark a function with N arguments. Although N is usually fairly small, that still seems like a very noisy call site for even a modest number of arguments.

I'm not sure if you're referring to "line noise" here (which, I agree, this does introduce a fair amount of line noise) or measurement noise. For the latter, a naive implementation of testing.Keep will introduce just a CALL/RET pair, and that we could easily optimize away either by making the compiler recognize no-op functions, or by making it recognize this particular function. Intrinsify-ing this function seems more palatable than intrinsify-ing Iterate, though that's just my opinion.

Making b.Iterate itself a compiler intrinsic is one possible alternative, although I agree that the implication for Iterate as a method-value is unfortunate.

Another possible option is that we make sure b.Iterate can be inlined and then teach the compiler how to eliminate a reflect.Call of a statically-known function. That feels less "special" than teaching it about b.Iterate. I'm not sure this is a good option, though, since it would also have to figure out the loop that sets up the reflect.Value arguments, and have to deal with the type-checking that reflect would be doing at run-time.

I'm not that concerned about people capturing b.Iterate (or any alternative) as a method value.

We already do some code generation for tests. Is there anything we could code-generate to help with this? We don't rewrite any test code right now, so this might require pushing that too far.

The stack frame implementation would not be able to set up the arguments just once. It would have to set them up on every iteration, since in general a function can overwrite its arguments, and many do.

Not to mention, I would expect most or all of the arguments to be passed in registers. We would certainly have to re-initialze those.

Jul 05 '23 20:07 aclements

What I had in mind is something like two functions: a (somewhat expensive) setup hook that checks types and copies function arguments from a slice into a more compact argument block, and a (cheap) “invoke” hook that initializes the call stack frame, copies arguments into registers, and calls the function.

The argument block might look something like:

+------------------------------------------------+
| pointer to GC map for argument block           |
+------------------------------------------------+
| function address                               |
+------------------------------------------------+
| closure address                                |
+------------------------------------------------+
| # of integer argument registers                |
+------------------------------------------------+
| # of FP argument registers                     |
+------------------------------------------------+
| spill + stack-assigned result area size        |
+------------------------------------------------+
| stack-assigned argument area size              |
+------------------------------------------------+
| integer register arguments                     |
|          ...                                   |
+------------------------------------------------+
| FP register arguments                          |
|          ...                                   |
+------------------------------------------------+
| stack-assigned arguments                       |
|          ...                                   |
+------------------------------------------------+

The implementation of iterate would be something like:

func (b *B) Iterate(f any, args ...any) {
	call := reflectlite.NewCall(f, args...)
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		call.Invoke()
	}
}

where call.Invoke() is similar in cost to a non-inlined function call, but perhaps with a couple of extra jumps to deal with argument-register counts that aren't compile-time constants.

That seems like it might be easier than teaching the compiler to inline through reflect.Call proper, but still a lot less run-time overhead than reflect.Call. (And it still leaves the door open for the inliner to figure out call.Invoke without trying to reason its way through all of the type-checking code that precedes it.)

Jul 05 '23 20:07 bcmills

As a developer I would much prefer the compiler be taught not to optimize away calls within a _test.go file instead of me having to remember to write a bunch of wrapper calls. I didn't see that listed in the alternatives, so my apologies if that has been proposed previously.

Jul 05 '23 22:07 rittneje

So I naively want the compiler not to optimize away things in a benchmark... but also some amount of the optimization happening would in fact be part of what the compiler would do running the code in reality, and thus, part of what I want to benchmark. The trick is distinguishing between optimizing-away the benchmark and optimizing-away some of the work inside the benchmark, which would also be optimized-away outside of the benchmark.

Jul 05 '23 23:07 seebs

Another alternative name for Keep is eval.

This function is not only useful for testing, but also for non-testing code.

Jul 06 '23 04:07 zigo101

I belive that another problem with testing.Iterate() would be with escape analysis. Right? It would cause heap allocations when returning pointer types, so it might cause bad benchmarking results.

Jul 07 '23 08:07 mateusz834

It would cause heap allocations when returning pointer types, so it might cause bad benchmarking results.

The existing ABI is such that if a function that returns a pointer to a new object is not inlined into the caller, the object to which it points must be heap-allocated. But that is part of the cost of calling the function; if you want to test how the function performs when it is fully inlined, you should benchmark an outer function that calls it in an inlineable way.

Jul 12 '23 18:07 bcmills

I am still concerned about the overhead of reflect in Iterate. We can't subtract it, and that means we can't reliably measure very short functions - which are the ones most likely to be affected by throwing away parts of the computation.

The compiler is going to be involved no matter what. What if it's more involved? Specifically, suppose we have a function that just does looping and takes func(), like

b.Loop(func() { Fib(10) })

or maybe

for range b.Loop {
   Fib(10)
}

and the compiler would recognize testing.B.Loop and apply Keep to every function and every argument in every call in that closure. We could still provide Keep separately and explain in the docs for Loop what the compiler is doing, in terms of Keep.

This would end up being like b.Iterate but (a) you get to write the actual calls, and (b) there is no reflect. On the other hand, the compiler does more work. But this is the Go compiler and the Go testing package; the compiler already knows about sync/atomic and math and other packages.

For that matter we could also recognize for i := 0; i < b.N; i++ { ... } and do the same to that loop body (it might still help to have something like Iterate or Loop though).

Jul 19 '23 17:07 rsc

I've just filed #61515, which I consider closely related to and complementary to this proposal and the Iterate proposal. I suggest we keep discussions of trade-offs between Iterate, Keep, and Loop in this issue.

Jul 21 '23 20:07 aclements

To quote #61515:

b.Loop could be a clear signal to the compiler not to perform certain optimizations in the loop body that often quietly invalidate benchmark results.

To be explicit, does that proposal include this compiler special-casing? Or is that proposal only for the API change and then in the future we will take a separate decision regarding special compilation of Loop?

You say that that proposal is complementary to this Keep proposal, but this specific change seems non-orthogonal. If we decide to do the compiler special-casing, that seems like it should bear on our decision about whether to expose Keep to the user at all.

Jul 22 '23 00:07 cespare

We are considering Keep and Loop-with-implicit-Keep together. (And the place for that combined discussion is this issue, not #61515.)

If we do the special casing, then we basically have to expose Keep too, so that we can explain what the special casing does and provide a way for users to have that same power themselves.

Jul 24 '23 03:07 rsc

To summarize the current state, the idea is to have Keep(x) return x but "hide" it from the compiler and disable throwing it away, so you can use Keep(f(Keep(x))) to both make sure f's result calculation is not optimized away and to keep the compiler from specializing an inlined copy of f to handle just x.

Then, over on #61515, we have a proposal to define b.Loop() that returns bool and is used like:

for b.Loop() { ... }

instead of

for i := 0; i < b.N; i++ { ... }

The nice thing about b.Loop is that the testing package can run code inside b.Loop to time groups of iterations separately, so that for example b.Loop could return true 10 times and see how long those iterations took, and then return true 100 more times and see how long those took, all without breaking the loop. This would remove the need to call a benchmark function more than once, and it would remove the need for b.ResetTimer - the only timing would be while the for loop is running. Setup and teardown would automatically not be counted.

And then on top of that, the compiler would recognize a for loop around b.Loop() and edit any calls inside the { ... } loop body to insert Keep around the result of the call and each argument.

With all that, a working, accurate benchmark for, say, unicode.IsSpace, would be:

func BenchmarkIsSpace(b *testing.B) {
    setup() // no setup really needed here but in general...
    for b.Loop() {
        unicode.IsSpace('x')
    }
    teardown() // same...
}

When users learn the pattern of using b.Loop, their benchmarks are easier to write and report real numbers.

This would be rewritten by the compiler to:

func BenchmarkIsSpace(b *testing.B) {
    setup() // no setup really needed here but in general...
    for b.Loop() {
        testing.Keep(unicode.IsSpace(testing.Keep('x')))
    }
    teardown() // same...
}

It might be better to rename Keep to Use too, but for clarity I've written this comment with Keep.

Jul 26 '23 17:07 rsc

So, if we wanted to benchmark inlining of unicode.IsSpace with the constant argument 'x', I guess that would be written as:

func BenchmarkIsSpaceInlinedConstant(b *testing.B) {
	setup() // no setup really needed here but in general...

	// Enable inlining by defining this outside of the b.Loop body.
	xIsSpace := func() bool {
		return unicode.IsSpace('x')
	}

	for b.Loop() {
		// Benchmark the call with the 'x' argument inlined.
		xIsSpace()
	}

	teardown() // same...
}

?

Jul 26 '23 18:07 bcmills

I'm still not real fond of the “b.Loop body is compiler magic” aspect of that approach, but I will admit that that's just an aesthetic preference, and at some point we're dealing with compiler magic no matter how we slice it. 🤷‍♂️

Jul 26 '23 19:07 bcmills

So, if we wanted to benchmark inlining of unicode.IsSpace with the constant argument 'x', I guess that would be written as:

I believe your example is right. It's awkward, but I think the only way to make it non-awkward is what we have today. Given that I'm pretty sure intentional constant propagation in benchmarks is extremely rare, it seems like the right balance to make the common intent (no constant propagation) the easy default, at the expense of making the rare intent awkward.

I'm still not real fond of the “b.Loop body is compiler magic” aspect of that approach

I'm a bit squeamish about this, too. But, it seems to me that there are the users who think about unintended optimization in benchmarks, and the users who don't. With a little compiler magic, we can just solve this problem for the users who don't think about it. And for the users who do think about it, hopefully they can also learn about the deoptimization effect of b.Loop. Also, there's no harm in continuing to do the sorts of "manual" deoptimization that people do today. My main concern is that refactoring of code within a b.Loop could have surprising effects on the result of a benchmark. I think attaching it to b.Loop is a lot more robust to refactoring than, say, deoptimizing the body of Benchmark functions, though.

Jul 28 '23 20:07 aclements

go go copied to clipboard

proposal: testing: add Keep, to force evaluation in benchmarks

Proposal

Alternatives

go
go copied to clipboard