go-ethereum trie: optimize node encoding using in-place RLP encoder.

According to benchmarks, the improvement is considerable.

benchstat master.txt drlp.txt 
goos: linux
goarch: amd64
pkg: github.com/ethereum/go-ethereum/trie
cpu: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
                  │ master.txt  │              drlp.txt               │
                  │   sec/op    │   sec/op     vs base                │
EncodeShortNode-8   82.46n ± 1%   44.81n ± 0%  -45.66% (p=0.000 n=10)
EncodeFullNode-8    342.3n ± 1%   264.0n ± 0%  -22.87% (p=0.000 n=10)
geomean             168.0n        108.8n       -35.26%

                  │ master.txt │              drlp.txt               │
                  │    B/op    │    B/op     vs base                 │
EncodeShortNode-8   48.00 ± 0%   48.00 ± 0%       ~ (p=1.000 n=10) ¹
EncodeFullNode-8    576.0 ± 0%   576.0 ± 0%       ~ (p=1.000 n=10) ¹
geomean             166.3        166.3       +0.00%
¹ all samples are equal

                  │ master.txt │              drlp.txt               │
                  │ allocs/op  │ allocs/op   vs base                 │
EncodeShortNode-8   1.000 ± 0%   1.000 ± 0%       ~ (p=1.000 n=10) ¹
EncodeFullNode-8    1.000 ± 0%   1.000 ± 0%       ~ (p=1.000 n=10) ¹
geomean             1.000        1.000       +0.00%
¹ all samples are equal

Jan 01 '24 15:01 qianbin

I think this is a great improvement, however since this is very core to geth, I don't know if we should import a library for it. I quickly went through the library you built and it seems very small and easy. Maybe we could think about moving this library into our rlp encoder in order to have that code under our control? Will tag for triage

Jan 02 '24 07:01 MariusVanDerWijden

@MariusVanDerWijden done

Jan 02 '24 10:01 qianbin

$ benchstat trie_encode.master trie_encode.pr 
name               old time/op    new time/op    delta
EncodeShortNode-8     144ns ± 7%      85ns ±11%  -41.29%  (p=0.008 n=5+5)
EncodeFullNode-8      700ns ± 3%     571ns ±16%  -18.48%  (p=0.008 n=5+5)

name               old alloc/op   new alloc/op   delta
EncodeShortNode-8     48.0B ± 0%     48.0B ± 0%     ~     (all equal)
EncodeFullNode-8       576B ± 0%      576B ± 0%     ~     (all equal)

name               old allocs/op  new allocs/op  delta
EncodeShortNode-8      1.00 ± 0%      1.00 ± 0%     ~     (all equal)
EncodeFullNode-8       1.00 ± 0%      1.00 ± 0%     ~     (all equal)

Jan 08 '24 09:01 holiman

It's a neat idea, to write lists in-place and then fix up the header. It can be optimized a bit more -- in both the "zero-case" and the "small-case", the header is exactly one byte. So we can just write one byte there. If we do that, we only have to move the content in the "large-case"; for the two other cases we just have to fix up the header.

diff --git a/rlp/raw.go b/rlp/raw.go
index 53da9f0d46..e92146a749 100644
--- a/rlp/raw.go
+++ b/rlp/raw.go
@@ -315,23 +315,24 @@ func AppendString(buf, str []byte) []byte {
 	}
 }
 
+func StartList(buf []byte) ([]byte, int) {
+	offset := len(buf)
+	// append a 1-byte header, which will suffice if content size is < 56 bytes
+	return append(buf, 0xc0), offset
+}
+
 // EndList ends up list starting from offset and returns the extended buffer.
 // Content after offset is treated as list content.
 func EndList(buf []byte, offset int) []byte {
-	contentSize := len(buf) - offset
-	if contentSize == 0 {
-		buf = append(buf, 0xC0)
-	} else if contentSize < 56 {
-		// shift the content for room of list header
-		buf = append(buf[:offset+1], buf[offset:]...)
+	contentSize := len(buf) - offset - 1
+	if contentSize < 56 {
 		// write list header
 		buf[offset] = 0xC0 + byte(contentSize)
-	} else {
-		headerSize := intsize(uint64(contentSize)) + 1
-		// shift the content for room of list header
-		buf = append(buf[:offset+headerSize], buf[offset:]...)
-		// write list header
-		appendUintWithTag(buf[:offset], uint64(contentSize), 0xF7)
+		return buf
 	}
-	return buf
+	headerSize := intsize(uint64(contentSize))
+	// shift the content for room of list header
+	buf = append(buf[:offset+headerSize], buf[offset:]...)
+	// write list header
+	return appendUintWithTag(buf[:offset], uint64(contentSize), 0xF7)
 }
diff --git a/trie/node_enc.go b/trie/node_enc.go
index c2a32a5d98..c5ed785419 100644
--- a/trie/node_enc.go
+++ b/trie/node_enc.go
@@ -39,7 +39,9 @@ func (n *shortNode) encode(buf []byte) []byte {
 	if buf == nil {
 		buf = make([]byte, 0, len(n.Key)+40)
 	}
-	offset := len(buf)
+	var offset int
+	buf, offset = rlp.StartList(buf)
+
 	buf = rlp.AppendString(buf, n.Key)
 	if n.Val != nil {
 		buf = n.Val.encode(buf)

That final optimization doesn't make any substantial difference though

$ benchstat  trie_encode.pr  trie_encode.pr.2 
name               old time/op    new time/op    delta
EncodeShortNode-8    84.5ns ±11%    80.1ns ±11%   ~     (p=0.421 n=5+5)
EncodeFullNode-8      571ns ±16%     567ns ±16%   ~     (p=1.000 n=5+5)

Jan 08 '24 10:01 holiman

It would be nice to get some real benchmark instead of low-level node encoding ones. For example, we have some nice ones for DeriveSha in core/types.

Jan 16 '24 13:01 fjl

If we do a round of benchmarks, we should probably apply @holiman's suggestion too.

Jan 16 '24 13:01 karalabe

Master -> original PR

[user@work go-ethereum]$ benchstat bench_types.master bench_types.drlp.1
name                       old time/op    new time/op    delta
DeriveSha200/std_trie-8       830µs ± 2%     824µs ± 2%    ~     (p=0.310 n=5+5)
DeriveSha200/stack_trie-8     517µs ± 5%     482µs ± 8%    ~     (p=0.056 n=5+5)

name                       old alloc/op   new alloc/op   delta
DeriveSha200/std_trie-8       290kB ± 0%     289kB ± 0%  -0.25%  (p=0.008 n=5+5)
DeriveSha200/stack_trie-8    46.4kB ± 0%    44.4kB ± 0%  -4.37%  (p=0.008 n=5+5)

name                       old allocs/op  new allocs/op  delta
DeriveSha200/std_trie-8       2.92k ± 0%     2.91k ± 0%  -0.09%  (p=0.000 n=5+4)
DeriveSha200/stack_trie-8     1.26k ± 0%     1.25k ± 0%  -0.71%  (p=0.008 n=5+5)

Original PR -> last commit

$ benchstat bench_types.drlp.1 bench_types.drlp.3
name                       old time/op    new time/op    delta
DeriveSha200/std_trie-8       824µs ± 2%     836µs ± 1%    ~     (p=0.056 n=5+5)
DeriveSha200/stack_trie-8     482µs ± 8%     490µs ± 4%    ~     (p=0.690 n=5+5)

name                       old alloc/op   new alloc/op   delta
DeriveSha200/std_trie-8       289kB ± 0%     289kB ± 0%  -0.04%  (p=0.008 n=5+5)
DeriveSha200/stack_trie-8    44.4kB ± 0%    44.3kB ± 0%  -0.02%  (p=0.008 n=5+5)

name                       old allocs/op  new allocs/op  delta
DeriveSha200/std_trie-8       2.91k ± 0%     2.91k ± 0%    ~     (p=0.079 n=4+5)
DeriveSha200/stack_trie-8     1.25k ± 0%     1.25k ± 0%    ~     (all equal)

EDIT: The initial numbers posted here were made with a version of the code which contained a flaw. After fixing the flaw in f59bd61, the second optimization was hardly noticeable.

Jan 18 '24 11:01 holiman

So it seems there is no big improvement in real cases. But your benchmark run shows something interesting, maybe we could get a benefit with this if we were able to perform some pre-allocation of the output []byte?

Jan 18 '24 12:01 fjl

So it seems there is no big improvement in real cases. But your benchmark run shows something interesting, maybe we could get a benefit with this if we were able to perform some pre-allocation of the output []byte?

It's as expected. The PR mainly provides several a bit faster RLP encoding methods. I don't think it's helpful to trie hashing.

Jan 19 '24 13:01 qianbin

go-ethereum go-ethereum copied to clipboard

trie: optimize node encoding using in-place RLP encoder.

go-ethereum
go-ethereum copied to clipboard