go icon indicating copy to clipboard operation
go copied to clipboard

unicode/utf8: make DecodeRune and DecodeLastRune inlineable

Open dsnet opened this issue 4 years ago • 22 comments

Code like the following is unfortunately very common:

r, n := rune(b[0]), 1
if r >= utf8.RuneSelf {
    r, n = utf8.DecodeRune(b)
}

and we have several examples in the standard library itself:

All of these cases ultimate come down to the caller manually performing inlined logic for the ASCII case. We should make it such that DecodeRune and DecodeLastRune are inlineable for the ASCII case. Unfortunately my attempts to do so exceed the inline budget by ever so much:

// cannot inline DecodeRune: function too complex: cost 87 exceeds budget 80
func DecodeRune(p []byte) (r rune, size int) {
	if len(p) > 0 && p[0] < RuneSelf {
		return rune(p[0]), 1
	}
	return decodeRune(p)
}

// cannot inline DecodeLastRune: function too complex: cost 93 exceeds budget 80
func DecodeLastRune(p []byte) (r rune, size int) {
	if len(p) > 0 && p[len(p)-1] < UTFMax {
		return rune(p[len(p)-1]), 1
	}
	return decodeLastRune(p)
}

We should find a way to make these inlinable whether by special casing them, clever tricks to make the cost lower, by increasing the inline budget, etc.

\cc @CAFxX @josharian @mdempsky @martisch

dsnet avatar Sep 05 '21 01:09 dsnet

FWIW, I gave it a shot before: https://go-review.googlesource.com/c/go/+/332770

Unfortunately yes, I really can not see a way to make it work without messing with the inliner (either by changing the heuristics, or by adding special cases) or otherwise adding some kind of annotation.

CAFxX avatar Sep 05 '21 07:09 CAFxX

Seem to remember commenting on a similar issue, that couldn't do the same with binary.Uvarint().

#31666

But now with for loop inlining since added to go, Uvarint() can be made to (replace range use) inline as whole without needing a separate func.

renthraysk avatar Sep 05 '21 11:09 renthraysk

The inline cost of the following implementation is 83:

func DecodeRune(p []byte) (r rune, size int) {
	if len(p) > 0 && p[0] < RuneSelf {
		r, size = rune(p[0]), 1
	} else {
		r, size = decodeRune(p)
	}
	return
}

zigo101 avatar Sep 07 '21 10:09 zigo101

And this one is 81:

func DecodeRune(p []byte) (r rune, size int) {
	if len(p) > 0 && p[0] < RuneSelf {
		return rune(p[0]), 1
	}
	r, size = decodeRune(p)
	return
}

zigo101 avatar Sep 07 '21 12:09 zigo101

Maybe the inline cost calculation should consider whether or not bound checks are eliminated. Here, the bound checks for p[0] are eliminated, so its cost should be some lower.

BTW, why the inline costs of function calls are so large?

zigo101 avatar Sep 07 '21 14:09 zigo101

The ExtraCall budget is 57 plus two arguments we had at least 64 budget, maybe we can force inline by asm code. However I've tried modified the ExtraCall budget and it appears only improve ASCII related DecodeRune and regression on slow path.

https://perf.golang.org/search?q=upload:20210922.6

name                               old time/op  new time/op  delta
RuneCountTenASCIIChars             9.27ns ± 1%  9.27ns ± 1%     ~     (p=0.151 n=9+9)
RuneCountTenJapaneseChars          37.8ns ± 2%  38.2ns ± 1%   +1.09%  (p=0.022 n=10+8)
RuneCountInStringTenASCIIChars     9.26ns ± 0%  9.26ns ± 0%     ~     (p=0.333 n=10+10)
RuneCountInStringTenJapaneseChars  39.3ns ± 1%  38.1ns ± 2%   -3.07%  (p=0.000 n=10+10)
EncodeASCIIRune                    1.71ns ± 0%  1.71ns ± 0%     ~     (all equal)
EncodeJapaneseRune                 2.86ns ± 0%  2.83ns ± 0%   -1.12%  (p=0.000 n=10+9)
AppendASCIIRune                    0.26ns ± 0%  0.26ns ± 0%   -2.44%  (p=0.000 n=7+10)
AppendJapaneseRune                 3.09ns ± 0%  3.08ns ± 0%   -0.02%  (p=0.029 n=9+9)
DecodeASCIIRune                    2.06ns ± 0%  0.26ns ± 0%  -87.53%  (p=0.000 n=8+8)
DecodeJapaneseRune                 3.60ns ± 0%  3.73ns ± 0%   +3.52%  (p=0.000 n=10+8)
FullRune/ASCII                     1.03ns ± 0%  1.18ns ± 3%  +14.28%  (p=0.000 n=10+10)
FullRune/Incomplete                2.18ns ± 2%  2.14ns ± 1%   -2.00%  (p=0.000 n=10+10)
FullRune/Japanese                  1.03ns ± 0%  1.18ns ± 3%  +15.03%  (p=0.000 n=9+10)

mengzhuo avatar Sep 22 '21 13:09 mengzhuo

However I've tried modified the ExtraCall budget and it appears only improve ASCII related DecodeRune and regression on slow path.

Isn't that expected? With inlining, we generally want to get the common case to be inlined (i.e., ASCII), and perform the function call for the less common case (i.e., multibyte Unicode).

dsnet avatar Sep 22 '21 15:09 dsnet

Change https://golang.org/cl/354769 mentions this issue: cmd/compile: zero cost of byte to rune conversion

gopherbot avatar Oct 08 '21 14:10 gopherbot

Change https://golang.org/cl/354770 mentions this issue: unicode/utf8: make DecodeRune inlined

gopherbot avatar Oct 08 '21 14:10 gopherbot

Change https://golang.org/cl/356769 mentions this issue: unicode/utf8: default ascii fast path for DecodeRune

gopherbot avatar Oct 19 '21 07:10 gopherbot

Building on @TapirLiu's latest attempt, the following implementation of DecodeRuneInString is inlineable:

func DecodeRuneInString(s string) (r rune, size int) {
	// Inlineable fast path for ASCII characters.
	if s != "" && s[0] < RuneSelf {
		return rune(s[0]), 1
	} else {
		r, size = decodeRuneInString(s)
	}
	return
}
$ go build -gcflags '-m=2' ./unicode/utf8 2>&1 | grep 'can inline DecodeRuneInString'
unicode/utf8/utf8.go:213:6: can inline DecodeRuneInString with cost 80 as [...]

Sadly, there's no equivalent trick for DecodeRune. 🤷

jub0bs avatar Aug 26 '25 13:08 jub0bs

Happy days! After a lot of trial and error, I was able to get DecodeRune under the inlining budget:

func DecodeRune(p []byte) (r rune, size int) {
	for _, b := range p {
		if b < RuneSelf {
			return rune(b), 1
		}
		break
	}
	r, size = decodeRune(p)
	return
}
$ go build -gcflags '-m=2' ./unicode/utf8 2>&1 | grep 'can inline DecodeRune with'
unicode/utf8/utf8.go:157:6: can inline DecodeRune with cost 79 as: [...]

The implementation is so weird that it likely deserves a clarifying implementation comment (perhaps linking to this issue). I'll open a first PR to make both DecodeRune{,InString} inlineable; if/when it gets merged, I'll follow up with a PR to get rid of all instances of manual inlining of those functions in the standard library.

jub0bs avatar Aug 28 '25 07:08 jub0bs

Change https://go.dev/cl/699675 mentions this issue: unicode/utf8: make DecodeRune{,InString} inlineable

gopherbot avatar Aug 28 '25 11:08 gopherbot

@jub0bs Looks like your latest DecodeRune solution would work for DecodeRuneInString too.

func DecodeRuneInString(s string) (r rune, size int) {
	for _, b := range []byte(s) {
		if b < RuneSelf {
			return rune(b), 1
		}
		break
	}
	r, size = decodeRuneInString(s)
	return
}
./decoderune.go:23:6: can inline DecodeRuneInString with cost 80 as: func(string) (rune, int) { for loop; r, size = rune(.autotmp_4), .autotmp_5; return  }
./decoderune.go:35:21: s does not escape
./decoderune.go:23:25: s does not escape
./decoderune.go:24:27: ([]byte)(s) does not escape
./decoderune.go:24:27: zero-copy string->[]byte conversion

magical avatar Aug 28 '25 20:08 magical

@magical That works too. However, a simpler (less weird) implementation of DecodeRuneInString also results in an inlining cost of 80; see https://github.com/golang/go/issues/48195#issuecomment-3224273708.

Perhaps there is an argument to keep DecodeRuneInString's implementation in sync with DecodeRune's, or perhaps the simplest possible implementation is preferrable. We'll see what the reviewers say.

jub0bs avatar Aug 29 '25 02:08 jub0bs

FWIW, I think the CL should also remove the manual fast paths peppered around the standard library, and ensure we don't regress there either (my old CL, that is likely stale now, did this).

CAFxX avatar Aug 29 '25 03:08 CAFxX

@CAFxX Ok, I've removed all of the manual inlining that you addressed in your abandoned CL. I've also removed two instances (in bytes/iter.go and strings/iter.go) that had since crept in.

jub0bs avatar Aug 29 '25 18:08 jub0bs

Is there anything else to do here?

seankhliao avatar Dec 07 '25 17:12 seankhliao

OT: I recall there was talk about „global“ go version performance comparisons based on canonical set of benchmarks. Does this exist yet?

andig avatar Dec 07 '25 17:12 andig

I think it would have been nice to add a test to check that the fast path is (and will remain) inlined

CAFxX avatar Dec 07 '25 23:12 CAFxX

@seankhliao The ASCII path in DecodeLastRune{,InString} remains un-inlineable. The smallest inlining cost I managed to reach is 85; perhaps someone else can do better. I suspect only that elusive inliner overhaul will truly help.

jub0bs avatar Dec 08 '25 10:12 jub0bs

@CAFxX

add a test to check that the fast path is (and will remain) inlined

In addition to the inlineability test in https://go-review.googlesource.com/c/go/+/699675/8/src/cmd/compile/internal/test/inl_test.go?

jub0bs avatar Dec 08 '25 10:12 jub0bs