cardano-db-sync
cardano-db-sync copied to clipboard
Blockfrost RYO specific improvements
Hi, I would like to propose additional improvements/additions to the db sync project, which would help to improve speed of RYO in the future.
A separate --blockfrost
switch (or more, to allow for granularity) could be introduced, as some of the data might generally be not needed and would take considerably large amount of disk space, since some of the improvements would require additional de-normalized tables. Other improvements include pre-calculation of historic data, which is not live, but takes considerable amount of resources to calculate on the fly.
Running balances
A starting point is to introduce address
and payment_cred
tables containing a single address just once (as opposed to tx_out
), similar to stake_address
table, which could then be used in several ways.
e.g.:
CREATE TABLE address (
id BIGSERIAL,
address TEXT NOT NULL
);
CREATE TABLE payment_cred (
id BIGSERIAL,
payment_cred TEXT NOT NULL
);
- Introduce "running"
balance
(i.e. the current state) of anaddress
/payment_cred
andstake_address
(stake_address table would need additionalbalance
field). This would greatly improve the overall speed when fetching balance. There are issues with multi assets (foraddress
/payment_cred
) though, as those could bloat some addresses, especially those related to exchanges etc. so this would require further thought, but calculating ADA balance is doable right off the bat. Having utxos available sure helps, but I believe it would be even better to also keep the balances. - Ideally, have
balance_in
andbalance_out
fields as well, or use those instead of a singlebalance
field
Live data
In addition, I would like to opt for:
- Introduce new fields to
pool_hash
table, or create a new table withlive_stake
,active_stake
,live_delegators
, and historicalblocks_minted
information which is expensive to calculate on the fly. - Introduce
assets
table, which would contain a given asset withid
(appended/ordered by MIN(ma_tx_mint.id) to allow for pagination), and calculatedquantity
andmint_or_burn_count
. - Introduce
stats
table, containing the network supply:total
,circulating
,locked
,treasury
,reserves
and total stake stats:live
andactive
- this info is needed for example to calculate pool saturation, live size etc.
Auxiliary data
Next, data which is not live, but still expensive to calculate on the fly:
- Introduce
stake_amount
(SUM) intoepoch
table (helpful in calculation of historicallive_size
of a pool). - Introduce
pool_history
table, which would contain data grouped by epoch, e.g.:
CREATE TABLE pools_history (
epoch_no BIGINT NOT NULL,
pool_id BIGINT NOT NULL,
blocks INTEGER NOT NULL,
delegators INTEGER NOT NULL,
active_size NUMERIC NOT NULL,
active_stake PUBLIC.LOVELACE,
rewards PUBLIC.LOVELACE NOT NULL,
fees PUBLIC.LOVELACE NOT NULL,
);
Denormalized transactions
Due to how pagination works in RYO, transaction history of address
, payment_cred
and asset
will require denormalized tables in order to speed up lookup, e.g.:
CREATE TABLE transactions_address (
tx_id BIGINT NOT NULL,
address_id/payment_cred_id/asset_id BIGINT NOT NULL,
hash TEXT NOT NULL,
block_height WORD31TYPE,
block_id BIGINT NOT NULL,
tx_index WORD31TYPE NOT NULL,
block_time INTEGER NOT NULL
);
with a requirement of tx_id
being consecutive i.e. without any gaps. Otherwise the lookups won't be optimal. This means not relying on pg's serial, but maintaining the ids internally (or using row_number () OVER ...
and always selecting the previous MAX).
Practically all aforementioned improvements have been tested one way or the other (in forms or triggers or matviews), so in case further SQL implementation/examples are needed, I am more than happy to share them with the team.
Note: this is a first attempt to address RYO improvements directly in db-sync, so it will definitely get amended and/or edited.
THANKS!
Some initial thoughts on this:
address and payment_cred tables
We've made some work on this. It slowed down syncing speed, so we decided to park it. It could become an optional feauture though, or default if we neutralise its effect with separate performance improvements
Introduce "running" balance Live data
A challenge with these cases is that each tx triggers an update instead of an inertion. When there are updates, handling rollbacks is never easy.
Auxiliary data
DBSync should be able to extract this from the epoch boundary ledger state.
Denormalized transactions
I don't really understand this. Would this have an entry for each tx and address?
separate --blockfrost switch
Wut? 😄 Hope if this is prioritized, the nomen clature continues to be based on utility, not companies. There are multiple API providers as well as projects that have requested parts of this over months (besides other featuresets)🙂 . Sorry for knit picking this - but just want to point it out , now that it is part of IO umbrella.
It slowed down syncing speed, so we decided to park it
@kderme - Given the recent config additions, could maybe make that a flag - as discussed in #1333 , the advantage from query pov (+ resource, particularly storage and IOPs consumption) is abundantly higher than initial 1-time sync
Wut? 😄 Hope if this is prioritized, the nomen clature continues to be based on utility, not companies. There are multiple API providers as well as projects that have requested parts of this over months (besides other featuresets)🙂 . Sorry for knit picking this - but just want to point it out , now that it is part of IO umbrella.
Fear not our goal is to improve usability/functionality/performance for the whole community who use cardano-db-sync
, no favouritism 😅
In all honesty we're really wanting to hear back from the community to know about their real life user cases so we can better steer what is and isn't needed. Also for us as a team to better explain the restrictions we always have to balance when trying to provide new features/fix existing issues and keeping up with new standards etc.
We have started a public Matrix channel (which we'll publicly advertise asap) we hope it will be a great place for informal discussions/questions to happen and open the communication channels up more. Then if things of importance arises on matrix we can move them to github issues with agreed specs and such. 😄
Some initial thoughts on this:
address and payment_cred tables
We've made some work on this. It slowed down syncing speed, so we decided to park it. It could become an optional feauture though, or default if we neutralise its effect with separate performance improvements
Optional sound reasonable.
Introduce "running" balance Live data
A challenge with these cases is that each tx triggers an update instead of an inertion. When there are updates, handling rollbacks is never easy.
Yes, I understand that. If it helps, I believe I have a working and battle-tested trigger solution (inserts/upserts/rollbacks,...), although just for ADA balances.
Auxiliary data
DBSync should be able to extract this from the epoch boundary ledger state.
Great to hear that!
Denormalized transactions
I don't really understand this. Would this have an entry for each tx and address?
Basically yes. The way we are currently retrieving txs are per address. Since there are some addresses which can be very large and we are using offset pagination instead of cursor (multiple reasons why), data can be very sparse. We need them dense and ideally know which page lands exactly where so we can retrieve the data with constant complexity.
FYI DBSync now has this table https://github.com/IntersectMBO/cardano-db-sync/blob/master/doc/schema.md#pool_stat
It's not populated by default only when "pool_stat": enabled is used in the db-sync insert config.