graph-node [Feature] Automate setting of account-like optimization

The account-like optimization is turned on with graphman stats account-like <sgdNNN> <table> and changes query generation for that table by adding to the clause block_range @> $block that all our queries contain the redundant clauses lower(block_range <= $block and coalesce(upper(block_range), 2147483647) > $block. That makes the BRIN index on these expressions that we have on all tables usable by the query planner; it can bring about a dramatic improvement in query speed when it works, but unfortunately, it can also slow down queries a lot in some cases. That's why we don't use it by default.

The general rule of thumb is that if the ratio of entities to versions (count(distinct id)/count(*)) is very low, say < 0.1%, that the account-like optimization can be useful. Because of the BRIN index, it is very dependent on the exact layout of data on disk. In Uniswap-like subgraphs, it usually is very effective for tables like token and pair, since in those, versions that are visible at a certain block tend to cluster. It's usually not effective for tables like token_day_data or pair_hour_data where versions with an unlimited upper bound on their block range are interspersed with very old versions throughout the heap. It's also not effective for entities that are immutable, as they all have an unlimited upper bound on their block range - if subgraph authors declare these entities as immutable, this isn't an issue as we don't store a block range in that case, just a start block and can use BTree indexes to search. But for entities that are immutable but not declared as such it can become an issue.

It would be great if we could have a reliable indication of when the account-like optimization will be effective. For example, we could look at how selective the BRIN index is at higher block numbers and adapt query generation accordingly. It's not clear to me how best we would gauge the selectiveness of the BRIN index; we might have to resort to page_inspect or some such.

Apr 27 '23 15:04 lutter

If by selectivity you mean non-overlappiness (I mean, how much the data is in order, therefore the value ranges of different index blocks don't overlap), then pageinspect is indeed a good candidate (I don't have a better idea, actually).

Otherwise, as I'm lacking the history of development, have you ever considered to keep the lower and upper bounds separate, and index them using btree? Especially in the case when there is a lot of overlapping in the BRIN, this might help (depending on cardinality, of course). It would also make it possible to replace the gist indexes (as far as I see, they are not used for actual exclusion).

Aug 22 '23 14:08 avandras

Yes, I think experimenting with btree for the block ranges would be a good idea; I'd expect an index on lower(block_range) to not be very useful since almost all queries are close to the chain head, so almost all entries will match. But it's definitely something we should try out.

Aug 22 '23 17:08 lutter

Looks like this issue has been open for 6 months with no activity. Is it still relevant? If not, please remember to close it.

Feb 22 '24 00:02 github-actions[bot]