go-ethereum ethdb/pebble: add database backend using pebble

Uses Pebble patched to expose an API for getting the amount of heap memory allocated through CGo: https://github.com/jwasinger/pebble/tree/mem-stats .

Modifies the system/memory/used and system/memory/held gauges to include Pebble's allocation through CGo.

Mar 31 '22 08:03 jwasinger

Charts from a snap sync (LevelDB is orange, Pebble is green):

Pebble disk usage:

311.7G	/datadir/geth/geth/chaindata/ancient
503.6G	/datadir/geth/geth/chaindata
230.3M	/datadir/geth/geth/ethash
670.1M	/datadir/geth/geth/triecache
4.4M	/datadir/geth/geth/nodes
504.5G	/datadir/geth/geth
4.0K	/datadir/geth/keystore
504.5G	/datadir/geth

Leveldb disk usage:

693.2M	/datadir/geth/geth/triecache
230.3M	/datadir/geth/geth/ethash
311.7G	/datadir/geth/geth/chaindata/ancient
559.7G	/datadir/geth/geth/chaindata
3.6M	/datadir/geth/geth/nodes
560.6G	/datadir/geth/geth
4.0K	/datadir/geth/keystore
560.6G	/datadir/geth

Apr 14 '22 09:04 jwasinger

Update: still waiting for upstream on https://github.com/cockroachdb/pebble/pull/1628

Jul 05 '22 12:07 fjl

Actually they have responded https://github.com/cockroachdb/pebble/pull/1628#pullrequestreview-1026664054 .

Jul 05 '22 13:07 jwasinger

@jwasinger would you mind rebasing this?

Aug 16 '22 15:08 holiman

@holiman done.

Aug 17 '22 03:08 jwasinger

Testing this on two bootnodes which have a very hard time syncing

ansible-playbook playbook.yaml -t geth -l bootnode-azure-westus-001,bootnode-azure-koreasouth-001  -e "geth_image=holiman/geth-experimental:latest" -e "geth_datadir_wipe=partial" -e '{"geth_args_custom":["--backingdb=pebble"]}'

and for comparison:

ansible-playbook playbook.yaml -t geth -l bootnode-azure-brazilsouth-001,bootnode-azure-australiaeast-001  -e "geth_image=holiman/geth-experimental:latest" -e "geth_datadir_wipe=partial"

Aug 29 '22 13:08 holiman

Metrics that do not appear to work: geth.eth/db/chaindata/disk/read.meter geth.eth/db/chaindata/compact/writedelay/counter.meter

Edi - here's why the disk read meter is always zerot:

		if db.diskReadMeter != nil {
			db.diskReadMeter.Mark(0) // pebble doesn't track non-compaction reads
		}

Is there any point in even having the metric around? Though I suppose we can keep it for a while, to stay in par with leveldb.

Aug 29 '22 13:08 holiman

It's still very early, but it looks like on our 'weak' azure nodes, pebble is performing a lot better, since it's not being killed-by-compaction:

Screenshot 2022-08-29 at 16-07-01 Dual Geth - Grafana

Aug 29 '22 14:08 holiman

The pebble azure nodes are on ~26%

bootnode-azure-westus-001 geth INFO [08-29|16:20:37.679] State sync in progress synced=26.53% state=51.08GiB accounts=49,463,[email protected] slots=193,320,[email protected] codes=206,[email protected] eta=8h46m27.524s
bootnode-azure-koreasouth-001 geth INFO [08-29|16:20:38.621] State sync in progress synced=26.15% state=50.79GiB accounts=48,789,[email protected] slots=192,700,[email protected] codes=203,[email protected] eta=8h56m8.768s

The non-pebble are on ~15%

bootnode-azure-brazilsouth-001 geth INFO [08-29|16:20:37.959] State sync in progress synced=13.69% state=27.49GiB accounts=26,174,[email protected] slots=104,749,[email protected] codes=117,[email protected] eta=19h31m38.490s
bootnode-azure-australiaeast-001 geth INFO [08-29|16:20:46.154] State sync in progress synced=16.99% state=32.69GiB accounts=32,593,[email protected] slots=123,280,[email protected] codes=141,[email protected] eta=15h11m3.934s

Aug 29 '22 16:08 holiman

Pebble-nodes finished the first phase a couple of hours earlier

Aug 30 03:30:56 bootnode-azure-koreasouth-001 geth INFO [08-30|01:30:55.936] State sync in progress synced=100.00% state=197.38GiB accounts=186,306,[email protected] slots=760,255,[email protected] codes=651,[email protected] eta=-2m14.173s
Aug 30 04:48:43 bootnode-azure-westus-001 geth INFO [08-30|02:48:43.208] State sync in progress synced=100.00% state=197.33GiB accounts=186,328,[email protected] slots=760,016,[email protected] codes=651,[email protected] eta=-2m31.628s

Leveldb-nodes:

Aug 30 06:31:50 bootnode-azure-australiaeast-001 geth INFO [08-30|04:31:50.585] State sync in progress synced=100.00% state=197.47GiB accounts=186,182,[email protected] slots=760,605,[email protected] codes=652,[email protected] eta=-2m43.671s
Aug 30 08:38:23 bootnode-azure-brazilsouth-001 geth INFO [08-30|06:38:23.436] State sync in progress synced=100.00% state=197.51GiB accounts=186,552,[email protected] slots=760,745,[email protected] codes=652,[email protected] eta=-1m27.429s

Aug 30 '22 07:08 holiman

RIght, there's this too:

C:\Users\appveyor\go\pkg\mod\github.com\jwasinger\[email protected]\internal\batchskl\skl.go:310:18: maxNodesSize (untyped int constant 4294967295) overflows int
C:\Users\appveyor\go\pkg\mod\github.com\jwasinger\[email protected]\internal\batchskl\skl.go:320:16: cannot use maxNodesSize (untyped int constant 4294967295) as int value in assignment (overflows)

Which afaict would be fixed by https://github.com/cockroachdb/pebble/pull/1619. It has been open since april.

Aug 30 '22 18:08 holiman

Found a little issue : Got cache configured to 256K (--cache 262144) ./ethdb/pebble/pebble.go:161: MemTableSize: cache * 1024 * 1024 / 4,

=>

geth[2101456]: Fatal: Failed to register the Ethereum service: MemTableSize (21 G) must be < 4.0 G

So, MemTableSize should be capped to 4GB max

65536 => Sep 16 13:51:30 geth01-ethereum-mainnet-eu geth[2105780]: Fatal: Failed to register the Ethereum service: MemTableSize (8.0 G) must be < 4.0 G

32768 => Sep 16 13:52:20 geth01-ethereum-mainnet-eu geth[2106436]: Fatal: Failed to register the Ethereum service: MemTableSize (4.0 G) must be < 4.0 G

from https://github.com/cockroachdb/pebble/blob/master/options.go MemTableSize is an int

Sep 16 '22 13:09 SLoeuillet

On an archive node, it doesn't seem to make sync faster, but at least, the ugly long compaction times (with some going up to 11 days full compacting in a row) are gone

2 archive nodes with standard geth with stair-case effect :

One of those 2 nodes using the pebble branch :

A mix of those 2 with both axes visible :

On 2022-06-29, started 2 archive nodes with geth 1.10.1x then 1.10.2x then 1.11.0 On 2022-07-13, stopped one of those 2, replaced it with 1.11.0, ex_pebble branch after wiping storage out.

Sep 19 '22 07:09 SLoeuillet

@SLoeuillet thanks for the feedback and charts! ~~Unfortunately, the Y-axis got a bit cropped out, so I couldn't really figure out how the two charts compared.~~ Would love to see some more charts after a few more days of progress!

Sep 19 '22 07:09 holiman

Ah, the max memtable size is not so much becaues the field is an int (int64), but https://github.com/cockroachdb/pebble/blob/master/open.go#L38:

	// The max memtable size is limited by the uint32 offsets stored in
	// internal/arenaskl.node, DeferredBatchOp, and flushableBatchEntry.
	maxMemTableSize = 4 << 30 // 4 GB

Sep 19 '22 07:09 holiman

With a great pleasure, I can announce that my archive node running leveldb standard storage, did just finish to sync to HEAD 2022-06-09 => 2022-09-21

Pebble 1.11.0 based one, started sync on 2022-09-16, currently at 7925855 blocks.

Sep 23 '22 14:09 SLoeuillet

I ran a successful snap sync with it again. Took a pretty long time on a very underprovisioned node (5.8GB usable RAM) but it finished after ~70 hours

Nov 09 '22 16:11 MariusVanDerWijden

Triage discussion: I'll take this PR and try to separate the 64-bit and 32-bit, and make it so that we avoid pebble when building 32-bit.

Jan 17 '23 13:01 holiman

Closing in favour of https://github.com/ethereum/go-ethereum/pull/26517

Jan 20 '23 21:01 holiman

go-ethereum go-ethereum copied to clipboard

ethdb/pebble: add database backend using pebble

go-ethereum
go-ethereum copied to clipboard