go-ethereum
go-ethereum copied to clipboard
trie: optimize node encoding using in-place RLP encoder.
According to benchmarks, the improvement is considerable.
benchstat master.txt drlp.txt
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/trie
cpu: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
│ master.txt │ drlp.txt │
│ sec/op │ sec/op vs base │
EncodeShortNode-8 82.46n ± 1% 44.81n ± 0% -45.66% (p=0.000 n=10)
EncodeFullNode-8 342.3n ± 1% 264.0n ± 0% -22.87% (p=0.000 n=10)
geomean 168.0n 108.8n -35.26%
│ master.txt │ drlp.txt │
│ B/op │ B/op vs base │
EncodeShortNode-8 48.00 ± 0% 48.00 ± 0% ~ (p=1.000 n=10) ¹
EncodeFullNode-8 576.0 ± 0% 576.0 ± 0% ~ (p=1.000 n=10) ¹
geomean 166.3 166.3 +0.00%
¹ all samples are equal
│ master.txt │ drlp.txt │
│ allocs/op │ allocs/op vs base │
EncodeShortNode-8 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=10) ¹
EncodeFullNode-8 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=10) ¹
geomean 1.000 1.000 +0.00%
¹ all samples are equal
I think this is a great improvement, however since this is very core to geth, I don't know if we should import a library for it. I quickly went through the library you built and it seems very small and easy. Maybe we could think about moving this library into our rlp encoder in order to have that code under our control? Will tag for triage
@MariusVanDerWijden done
$ benchstat trie_encode.master trie_encode.pr
name old time/op new time/op delta
EncodeShortNode-8 144ns ± 7% 85ns ±11% -41.29% (p=0.008 n=5+5)
EncodeFullNode-8 700ns ± 3% 571ns ±16% -18.48% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
EncodeShortNode-8 48.0B ± 0% 48.0B ± 0% ~ (all equal)
EncodeFullNode-8 576B ± 0% 576B ± 0% ~ (all equal)
name old allocs/op new allocs/op delta
EncodeShortNode-8 1.00 ± 0% 1.00 ± 0% ~ (all equal)
EncodeFullNode-8 1.00 ± 0% 1.00 ± 0% ~ (all equal)
It's a neat idea, to write lists in-place and then fix up the header. It can be optimized a bit more -- in both the "zero-case" and the "small-case", the header is exactly one byte. So we can just write one byte there. If we do that, we only have to move the content in the "large-case"; for the two other cases we just have to fix up the header.
diff --git a/rlp/raw.go b/rlp/raw.go
index 53da9f0d46..e92146a749 100644
--- a/rlp/raw.go
+++ b/rlp/raw.go
@@ -315,23 +315,24 @@ func AppendString(buf, str []byte) []byte {
}
}
+func StartList(buf []byte) ([]byte, int) {
+ offset := len(buf)
+ // append a 1-byte header, which will suffice if content size is < 56 bytes
+ return append(buf, 0xc0), offset
+}
+
// EndList ends up list starting from offset and returns the extended buffer.
// Content after offset is treated as list content.
func EndList(buf []byte, offset int) []byte {
- contentSize := len(buf) - offset
- if contentSize == 0 {
- buf = append(buf, 0xC0)
- } else if contentSize < 56 {
- // shift the content for room of list header
- buf = append(buf[:offset+1], buf[offset:]...)
+ contentSize := len(buf) - offset - 1
+ if contentSize < 56 {
// write list header
buf[offset] = 0xC0 + byte(contentSize)
- } else {
- headerSize := intsize(uint64(contentSize)) + 1
- // shift the content for room of list header
- buf = append(buf[:offset+headerSize], buf[offset:]...)
- // write list header
- appendUintWithTag(buf[:offset], uint64(contentSize), 0xF7)
+ return buf
}
- return buf
+ headerSize := intsize(uint64(contentSize))
+ // shift the content for room of list header
+ buf = append(buf[:offset+headerSize], buf[offset:]...)
+ // write list header
+ return appendUintWithTag(buf[:offset], uint64(contentSize), 0xF7)
}
diff --git a/trie/node_enc.go b/trie/node_enc.go
index c2a32a5d98..c5ed785419 100644
--- a/trie/node_enc.go
+++ b/trie/node_enc.go
@@ -39,7 +39,9 @@ func (n *shortNode) encode(buf []byte) []byte {
if buf == nil {
buf = make([]byte, 0, len(n.Key)+40)
}
- offset := len(buf)
+ var offset int
+ buf, offset = rlp.StartList(buf)
+
buf = rlp.AppendString(buf, n.Key)
if n.Val != nil {
buf = n.Val.encode(buf)
That final optimization doesn't make any substantial difference though
$ benchstat trie_encode.pr trie_encode.pr.2
name old time/op new time/op delta
EncodeShortNode-8 84.5ns ±11% 80.1ns ±11% ~ (p=0.421 n=5+5)
EncodeFullNode-8 571ns ±16% 567ns ±16% ~ (p=1.000 n=5+5)
It would be nice to get some real benchmark instead of low-level node encoding ones. For example, we have some nice ones for DeriveSha
in core/types.
If we do a round of benchmarks, we should probably apply @holiman's suggestion too.
Master -> original PR
[user@work go-ethereum]$ benchstat bench_types.master bench_types.drlp.1
name old time/op new time/op delta
DeriveSha200/std_trie-8 830µs ± 2% 824µs ± 2% ~ (p=0.310 n=5+5)
DeriveSha200/stack_trie-8 517µs ± 5% 482µs ± 8% ~ (p=0.056 n=5+5)
name old alloc/op new alloc/op delta
DeriveSha200/std_trie-8 290kB ± 0% 289kB ± 0% -0.25% (p=0.008 n=5+5)
DeriveSha200/stack_trie-8 46.4kB ± 0% 44.4kB ± 0% -4.37% (p=0.008 n=5+5)
name old allocs/op new allocs/op delta
DeriveSha200/std_trie-8 2.92k ± 0% 2.91k ± 0% -0.09% (p=0.000 n=5+4)
DeriveSha200/stack_trie-8 1.26k ± 0% 1.25k ± 0% -0.71% (p=0.008 n=5+5)
Original PR -> last commit
$ benchstat bench_types.drlp.1 bench_types.drlp.3
name old time/op new time/op delta
DeriveSha200/std_trie-8 824µs ± 2% 836µs ± 1% ~ (p=0.056 n=5+5)
DeriveSha200/stack_trie-8 482µs ± 8% 490µs ± 4% ~ (p=0.690 n=5+5)
name old alloc/op new alloc/op delta
DeriveSha200/std_trie-8 289kB ± 0% 289kB ± 0% -0.04% (p=0.008 n=5+5)
DeriveSha200/stack_trie-8 44.4kB ± 0% 44.3kB ± 0% -0.02% (p=0.008 n=5+5)
name old allocs/op new allocs/op delta
DeriveSha200/std_trie-8 2.91k ± 0% 2.91k ± 0% ~ (p=0.079 n=4+5)
DeriveSha200/stack_trie-8 1.25k ± 0% 1.25k ± 0% ~ (all equal)
EDIT: The initial numbers posted here were made with a version of the code which contained a flaw. After fixing the flaw in f59bd61, the second optimization was hardly noticeable.
So it seems there is no big improvement in real cases. But your benchmark run shows something interesting, maybe we could get a benefit with this if we were able to perform some pre-allocation of the output []byte?
So it seems there is no big improvement in real cases. But your benchmark run shows something interesting, maybe we could get a benefit with this if we were able to perform some pre-allocation of the output []byte?
It's as expected. The PR mainly provides several a bit faster RLP encoding methods. I don't think it's helpful to trie hashing.