cardano-db-sync
cardano-db-sync copied to clipboard
Db sync should possibly avoid trying to fetch non-current pool metadata
OS Your OS: Ubuntu
Versions
The db-sync version (eg cardano-db-sync --version): 13.6.0.4
PostgreSQL version: 17
Build/Install Method
The method you use to build or install cardano-db-sync: downloaded binaries
Run method
The method you used to run cardano-db-sync (eg Nix/Docker/systemd/none): systemd
Problem Report After doing some analysis of the records in off_chain_pool_fetch_error table and being perplexed for a little while about some metadata hash mismatch messages in there, it was pointed out to me by a fellow colleague that some of the records i was looking at were due to DB sync trying to fetch previous / not-most-recent version of the metadata using the now outdated either url or hash or both. After manually purging the off_chain_pool_fetch_error table on my DB sync instances the earlier seen errors seem to have gone away. This is an example of a log message in question:
230668 | 3085 | 2024-12-31 00:58:47.28307 | 24111 | Hash mismatch when fetching metadata from https://public.bladepool.com/metadata.json. Expected "2738e2233800ab7f82bd2212a9a55f52d4851f9147f161684c63e6655bedb562" but got "d7c25ea70f63c45413d56c35a80293e7dd859233c43c25e1b0cad2738cdfc037". | 51
230652 | 1498 | 2024-12-30 12:04:28.439982 | 27276 | Hash mismatch when fetching metadata from https://raw.githubusercontent.com/Bmtxs/sp/master/na.json. Expected "48cbb69c4384c9847369e89fd693e637236afb174813e05b6464e1cf2aea037d" but got "1df6e0d2b80ba684fbcca263fde20cfe8b5aa7a30ce15ff1fd79a8df2c5840a7". | 49
in both cases the pmr_id column value refers to not-the-most-recent pool update record, and in my case caused a bit of confusion. So this ticket is primarily to trigger a consideration of whether once new pmr_id is established for a pool, the retries can be cancelled for previous pmr_ids and possibly some table cleanup can be performed at that point in time (unless there's some value in retaining retry history in all of its entirety for previous iteration of metadata)
I'm also seeing metadata fetching attempts being made for a pool that retired in epoch 210 (i.e. back in year 2020), possibly another small optimisation opportunity unless this is all by design to try and have as thorough database representation of all pools as possible.
While the immediate task as put aptly by @kderme is 'to have a policy that stops fetching attempts when there is a newer pool update' , a side-action/question here is also, is there any thought about adding control for pool metadata refresh from dbsync itself:
- Ability to blacklist a pool
- Ability to manually refresh a given pool's metadata
I think for years, best practices around pool metadata operations for SPO has been to update contents (thus, different meta hash) when pool makes any meta changes - eg: for CNTools, we already add a nonce field to ensure users dont put multiple update entries with same URL/hash combination. For those who do not follow, Perhaps above could be managed as an addition of a column status into pool_metadata_ref table (allowing us to put a blacklist or refresh to specific pools):
- If
statusissuccess, we already have an entry in pool offline data, do not re-attempt fetch in future for that id - If
statusisfailed, there will be a re-attempt for that id (using current logic with sleeps) until a successful fetch or a new pool update entry - If
statusisskip, dbsync will not attempt fetching URL for this pool metadata reference ID. - If
statusisblacklist[manually overriden] , same asskip...but this allows manual control (cannot be overridden) for notorious / bad ops folks - If
statusisrefresh, dbsync will re-attempt fetching metadata - allowing manual control for refresh for a given entry (eg: if the next polling for this entry it too far out)
Perhaps above could be managed as an addition of a column
statusinto pool_metadata_ref table (allowing us to put a blacklist or refresh to specific pools):
- If
statusissuccess, we already have an entry in pool offline data, do not re-attempt fetch in future for that id- If
statusisfailed, there will be a re-attempt for that id (using current logic with sleeps) until a successful fetch or a new pool update entry- If
statusisskip, dbsync will not attempt fetching URL for this pool.- If
statusisblacklist[manually overriden] , dbsync will skip fetching too...this allows manual control for notorious / bad ops folks- If
statusisrefresh, dbsync will re-attempt fetching metadata - allowing manual control for refresh for a given entry
I like the idea of a status field. Currently, we look for corresponding rows in off_chain_pool_data and off_chain_pool_fetch_error, which I don't think will scale much farther.
Since the problem only appears whilet DBSync is still syncing, I think it's not a big issue.
Using a status field to separate the current pool update from previous ones, would be quite useful. For the next major DBSync release we're trying to focus more on live data and separate them from the historic ones, eg this is similar https://github.com/IntersectMBO/cardano-db-sync/issues/1798
Designing the state machine, through rollbacks, manual intervention could be tricky.
Delisting pools is already supported in table delisted_pool, but it may only affect the smash server.