erigon icon indicating copy to clipboard operation
erigon copied to clipboard

experiment: Page-level compression of AccountsHistory

Open Giulio2002 opened this issue 2 months ago • 24 comments

experiment: Page-level compression of AccountsHistory

Extending Commitment File Optimizations to Account History

Rationale

The Accounts snapshot files is the largest historical record in Erigon.
Like the commitment files, it consists of many fixed-size (32-byte) words that often contain repetitive or slowly changing data.
We can apply a similar approach—grouping N sequential words into compressed frames—to reduce disk footprint and I/O cost, while keeping random access efficient.


Concept

  • Split account history into frames of N sequential words (e.g., 4K words per frame).
  • Compress each frame (using zstd or lz4).
  • Maintain a sparse index to enable direct access to any range without full decompression.
  • Keep append-only semantics; compaction rewrites data into compressed form.

This balances compression ratio, read speed, and simplicity.


Epic-Reviewer: @wmitsuda

EDIT: We have repeated values in history so this will be part of the same issue

Giulio2002 avatar Oct 07 '25 11:10 Giulio2002

VBulikov added Imp2 as Importance

VBulikov avatar Oct 17 '25 11:10 VBulikov

i'm not sure if this is still correct: the .vi files can be made 1/N of the current size (N is pageSize) currently we add to recsplit for all keys, we don't need to do that. We just need to store the ones at the start of the page. We have efficient means to search within the page as well.

sudeepdino008 avatar Nov 05 '25 09:11 sudeepdino008

we already using class db/seg/seg_paged_rw.go:Page which has format:

metadata
keys_list
values_list

so, keys are already co-located - good for linear/binary search.

Page class doesn't really aligned to "OS PageSize" - it's measured not in Kb but in amount of keys. Like: page storing 16 key/values and doesn't matter how much OS Pages it takes. But because keys are co-located: linear/binary search will touch 1 OS Page.

AskAlexSharov avatar Nov 06 '25 11:11 AskAlexSharov

currently we add to recsplit for all keys, we don't need to do that. We just need to store the ones at the start of the page unfortunately we can't do it. Because recsplit support only Get(key) operation No, Get(prefix). If add only 1 key from Page to recsplit then no way to understand on which page are other 15 keys.

.bt index can do it (i was able to implement it in my prototype). So, PageCompression of .kv can reduce .bt size also (but maybe will not reduce - because .bt does binary-search on it's LeafPage - which means will lookup around several "Pages" addressed by 1 .bt LeafPage).

AskAlexSharov avatar Nov 06 '25 11:11 AskAlexSharov

Because recsplit support only Get(key) operation No, Get(prefix)

ah yes. I remember having done this in forkables, where it was possible because key was uint64...so you could just store offset in recsplit for (key/16)*16

sudeepdino008 avatar Nov 06 '25 12:11 sudeepdino008

so you could just store offset in recsplit for (key/16)*16 - yes, RCacheDomain already stored this way in recsplit now: all16keys -> offsetOfPageStart

AskAlexSharov avatar Nov 06 '25 13:11 AskAlexSharov

So, I experimented with enabling page-level compression using different parameters for the accounts history .v files (I used Gnosis mainnet for this). Depending on the page size, the compression ratio for the accounts history varies:

  • page == 16 → about compression
  • page == 32 → about compression
  • page == 64 → about compression
  • page == 128 → about compression

The relationship seems roughly logarithmic — which is probably expected. There are actually two benchmarks to consider here:

  1. The overall size after compression, and
  2. The speed of random access to a specific value.

However, my current access test is somewhat artificial — since, knowing the .vi file, we can jump to a page with a single disk seek using the known offset. The page decompression in memory itself is relatively fast — so in essence, my benchmark measures the speed of GetFromPage, which might not be sufficient.

I compressed the .v files through a migration process — since knowing .ef + .vi + .v, it’s possible to correctly rebuild both .v and .vi files. I also have a small observation: if you look at what’s actually stored in the account history snapshots, you’ll often see repeating triplets like these:

acc: fff8859f1d982ce5b89abc86076876a3db659c79
txn: 402429537
val: 010100209d4059abf46db50e4b48d9b6404b1f200cb5a1efeb3725be2de630a2a90019440101

acc: fff8859f1d982ce5b89abc86076876a3db659c79
txn: 402473421
val: 010100209d4059abf46db50e4b48d9b6404b1f200cb5a1efeb3725be2de630a2a90019440101

acc: fff8859f1d982ce5b89abc86076876a3db659c79
txn: 402561162
val: 010100209d4059abf46db50e4b48d9b6404b1f200cb5a1efeb3725be2de630a2a90019440101

acc: fff8859f1d982ce5b89abc86076876a3db659c79
txn: 402908347
val: 010100209d4059abf46db50e4b48d9b6404b1f200cb5a1efeb3725be2de630a2a90019440101

So, the value for this account didn’t actually change across these transaction points.

In essence — for history (we conceptually store intervals) — this interval between 402429537 and 402908347 should really collapse into one. (And such sequences of repeated values can actually be quite long.)

That means we’re currently storing more data than necessary, and the size of all files (not just .v, but also .ef, .efi, and .vi) could be reduced by removing these redundant points.

For comparison — even without page-level compression, if I deduplicate such repeated values, the .v file size drops by about . With deduplication and page-level compression, the compression ratio improves further:

  • page == 16 → about compression
  • page == 32 → about compression
  • page == 64 → about compression
  • page == 128 → about compression

eastorski avatar Nov 10 '25 23:11 eastorski

Let's move the discussion of the tasks from private to #core or to the epic on GitHub. (Please copy and paste your comment there.)

  1. For the bench, try drop_caches + eth_getLogs for the block range.

  2. If you see duplicate values ​​in the history, it might be a history bug (or you're reading it incorrectly; show what methods/code you're using). Look at the HistoryRange method, not GetAsOf for this.

  3. Yes, if there are duplicates in the history, we'll kill them.

  4. What if I call "zstd acc.v" in the console?

AskAlexSharov avatar Nov 11 '25 01:11 AskAlexSharov

I did not use HistoryRange to print duplications. I tried but realised soon that use-case of this method is not what I need, I explained it in this PR https://github.com/erigontech/erigon/pull/17864 which also contains code how I read the history snapshots.

zstd works well, it allows me to compress accounts history files in about 10-12x times

eastorski avatar Nov 12 '25 00:11 eastorski

  1. If you see duplicate values ​​in the history, it might be a history bug (or you're reading it incorrectly; show what methods/code you're using). Look at the HistoryRange method, not GetAsOf for this.

when I analyzed history of accounts months ago, I figured the main case is: very popular contract addresses are touched millions of times, but they don't change anything, only their state. but we don't store storage root in serialized accounts, hence it is the same [nonce, balance, codehash, incarnation] repeated millions of times in history.

wmitsuda avatar Nov 12 '25 08:11 wmitsuda

func (sd *SharedDomains) DomainPut(domain kv.Domain, roTx kv.TemporalTx, k, v []byte, txNum uint64, prevVal []byte, prevStep kv.Step) error {
	if v == nil {
		return fmt.Errorf("DomainPut: %s, trying to put nil value. not allowed", domain)
	}
	ks := string(k)
	sd.sdCtx.TouchKey(domain, ks, v)

	if prevVal == nil {
		var err error
		prevVal, prevStep, err = sd.GetLatest(domain, roTx, k)
		if err != nil {
			return err
		}
	}
	switch domain {
	case kv.CodeDomain:
		if bytes.Equal(prevVal, v) {
			return nil
		}
	case kv.StorageDomain, kv.AccountsDomain, kv.CommitmentDomain, kv.RCacheDomain:
		//noop
	default:
		if bytes.Equal(prevVal, v) {
			return nil
		}
	}
	return sd.mem.DomainPut(domain, ks, v, txNum, prevVal, prevStep)
}

AskAlexSharov avatar Nov 12 '25 09:11 AskAlexSharov

set this var rpcDisableRCache = dbg.EnvBool("RPC_DISABLE_RCACHE", false) when calling eth_getLogs

AskAlexSharov avatar Nov 12 '25 09:11 AskAlexSharov

@eastorski RPC performance impact?

AskAlexSharov avatar Nov 24 '25 05:11 AskAlexSharov

Giulio2002 changed Start date from 17/10/2025 to 24/11/2025

VBulikov avatar Nov 25 '25 08:11 VBulikov

Giulio2002 changed Due date from 31/10/2025 to 08/12/2025

VBulikov avatar Nov 25 '25 08:11 VBulikov

Giulio2002 changed Start date from 24/11/2025 to 23/11/2025

VBulikov avatar Nov 25 '25 08:11 VBulikov

Giulio2002 changed Due date from 08/12/2025 to 07/12/2025

VBulikov avatar Nov 25 '25 08:11 VBulikov

Giulio2002 changed Due date from 07/12/2025 to 15/12/2025

VBulikov avatar Nov 25 '25 08:11 VBulikov

Giulio2002 changed Due date from 15/12/2025 to 17/12/2025

VBulikov avatar Nov 25 '25 08:11 VBulikov

Giulio2002 changed Due date from 17/12/2025 to 19/12/2025

VBulikov avatar Nov 25 '25 08:11 VBulikov

AskAlexSharov changed Due date from 19/12/2025 to 31/12/2025

VBulikov avatar Nov 25 '25 08:11 VBulikov

AskAlexSharov changed Due date from 31/12/2025 to 13/01/2026

VBulikov avatar Nov 25 '25 08:11 VBulikov

AskAlexSharov changed Start date from 23/11/2025 to 16/10/2025

VBulikov avatar Nov 25 '25 08:11 VBulikov

@eastorski

  • what is speed difference between: decompression of 1 account from page-level comp, word-level comp, no-comp?
  • what is eth_getBalance bottleneck (show pprof pic plz) in your bench? If json or http-compression is a bottleneck - then maybe need disable http compression - or test speed of eth_getLogs
  • what is throughput change of some rpc method (if throw much same RPC requests)?

AskAlexSharov avatar Dec 16 '25 02:12 AskAlexSharov

@eastorski

  • what is speed difference between: decompression of 1 account from page-level comp, word-level comp, no-comp?
  • what is eth_getBalance bottleneck (show pprof pic plz) in your bench? If json or http-compression is a bottleneck - then maybe need disable http compression - or test speed of eth_getLogs
  • what is throughput change of some rpc method (if throw much same RPC requests)?

I ran profiling for the case with and without compression, using Sepolia as an example (screenshots below).

I tested it using parallel request sending for eth_getBalance. In the end, I didn’t observe any changes in throughput. I tried different configurations with GOMAXPROCS set to 1, 2, and 4.

GOMAXPROCS directly affects throughput and scales it:

Processed accounts: 828228 took 115484ms perTx 0.14ms // uncompressed. GOMAXPROCS  1
Processed accounts: 809084 took 69015ms perTx 0.09ms.  // uncompressed. GOMAXPROCS  2
Processed accounts: 833427 took 57416ms perTx 0.07ms.   // uncompressed. GOMAXPROCS  4


Processed accounts: 843945 took 115398ms perTx 0.14ms // compressed. GOMAXPROCS  1
Processed accounts: 855608 took 71533ms perTx 0.08ms. // compressed. GOMAXPROCS  2
Processed accounts: 844168 took 60035ms perTx 0.07ms // compressed. GOMAXPROCS  4

However, page-level compression does not have an impact. I think this is because from the algorithm’s point of view, in both cases it’s a single FS jump (since we have an exact offset in the index). After that, it doesn’t really matter whether we read N bytes or 16N bytes (roughly estimated, ignoring the page compression ratio). There’s little difference, especially if the size of those 16 values does not significantly exceed the page size.

So, judging by these profiles, as I said before, CPU resources directly affect throughput. I also verified this using cpulimit, but cpulimit gives a somewhat misleading result because it simply limits execution time, rather than providing a real independent performance gain as with true parallelism.

I would enable using page-level compression for historical account snapshots, because it provides real disk space savings without degrading random access performance or throughput.

Compressed GOMAXPROCS 1: Image

Compressed GOMAXPROCS 2: Image

Compressed GOMAXPROCS 4: Image

Uncompressed GOMAXPROCS 1: Image

Uncompressed GOMAXPROCS 2:

Image

Uncompressed GOMAXPROCS 4:

Image

eastorski avatar Dec 22 '25 22:12 eastorski

  1. --http.compression=false
  2. I don't see paged-reader on pprof pictures.

AskAlexSharov avatar Dec 23 '25 03:12 AskAlexSharov

  1. --http.compression=false
  2. I don't see paged-reader on pprof pictures.

It takes a little time as I can see from pprof, that's why it is not shown in the graph. if I run pprof top cmd directly I can find it:

(pprof) top GetFromPage -num 100 -cum -desc
Active filters:
   focus=GetFromPage
   ignore=num|desc
Showing nodes accounting for 0.15s, 0.22% of 67.68s total

         0     0%     0%      0.15s  0.22%  github.com/erigontech/erigon/db/seg.GetFromPage

This is where it comes from inside historySeekInFiles func:

	if compressedPageValuesCount > 1 {
		v, ht.snappyReadBuffer = seg.GetFromPage(historyKey, v, ht.snappyReadBuffer, true)
	}

The pprof graph without http compression:

Image

eastorski avatar Dec 23 '25 15:12 eastorski

Ok. Then let’s release it.

AskAlexSharov avatar Dec 24 '25 00:12 AskAlexSharov