cardano-db-sync icon indicating copy to clipboard operation
cardano-db-sync copied to clipboard

drep_distr table data discrepancy between seemingly identical 13.6.0.5/10.1.4 setups

Open hodlonaut opened this issue 6 months ago • 7 comments

OS Your OS: Debian 12

Versions The db-sync version (eg cardano-db-sync --version): PostgreSQL version: cardano-db-sync 13.6.0.5 - linux-x86_64 - ghc-8.10 git revision cb61094c82254464fc9de777225e04d154d9c782

Build/Install Method The method you use to build or install cardano-db-sync: Provided continuous-integration binaries

Run method The method you used to run cardano-db-sync (eg Nix/Docker/systemd/none): systemd

Additional context The discrepancy was first reported for drep with id drep1yttcav7gh3xlkqd876gmgma32c7qj555ajnn3fnp9kl6l4g97vcyz , but later when comparing the drep_distr values between couple of similar setups (same version of node and db sync), the earlier discrepancy I discovered was in epoch 538, involving drep with id drep1ytuvz8hq2qmcdvy9rs4erpfhge5gypfut9yva9tjj2vw66g73ge5q. On the problematic instance, the drep's voting power was reported as being between 1k and 2k Ada, meanwhile on all the others I saw it was between 1 and 2 million Ada.

For epoch 538, drep_distr value of 1,093,348,347,589 in 'good' db, 1,476,287,731 in 'problematic' one.

Epoch stake entries for the 1m stake account that's delegating to this drep seem to be the same on both instances. as do epoch_stake amounts for all stake accounts (11) delegating to that drep for epochs 536 -> 540

cardano-cli conway query drep-stake-distribution output on both machines returns the same value for that drep.

Difference between epoch's sum of voting power (group by epoch_no of drep_distr table data) was 1.09 mil at 538, and was up to 2 million at epoch 554

hodlonaut avatar May 12 '25 06:05 hodlonaut

release 13.6.0.4 has this

Known Issue
For networks that were already in Protocol Version 10 (e.g. Preview and PreProd) before the upgrade 
to this release, the values of drep_distribution before the upgrade are not well defined and could be
slightly different between different executions. This shouldn't have any cascading effect. For these 
networks it is advised to delete all ledger state files and let the ledger rules replay from genesis. 
You can read more about this issue https://github.com/IntersectMBO/cardano-ledger/issues/4772,

In general 13.6.0.4 was recommended to upgrade from older versions before the HF. Could this be relevant?

kderme avatar May 15 '25 00:05 kderme

This instance is on mainnet & was created on 7th January and was already on 13.6.0.4 using a snapshot from another instance (which is same base snapshot used for majority of other instances). The mentioned fork occured in February - thus, I believe should not follow the pattern mentioned above(?) Additional notes (if any relevant):

  • The upgrade from 13.6.0.4 to 13.6.0.5 occured on March 20 (as in-place-upgrade), if relevant
  • All our instances use config as below:
"insert_options": {          
    "tx_out": {        
      "value": "consumed",
      "use_address_table": true
    },                                              
    "ledger": "enable",          
    "shelley": {
      "enable": true
    },      
    "multi_asset": {     
      "enable": true                                  
    },                                                                                                                 
    "metadata": {                
      "enable": true
    },       
    "plutus": {
      "enable": true                                
    },                              
    "governance": "enable",
    "json_type": "text",
    "offchain_pool_data": "enable",
    "pool_stat": "enable",
    "tx_cbor": "enable"
  },

rdlrt avatar May 15 '25 01:05 rdlrt

Is the data for drep_distr population e.g. at epoch boundary fetched from some other data in db sync or from ledger/node? If the latter, can we manually request the same information for the dreps in question from that location to confirm there's inaccurate information being returned?

gregbgithub avatar May 15 '25 02:05 gregbgithub

I wonder if the snapshots used a previous version when generated and thus when against 13.6.0.5 the continuation of the sync causes errors?

Cmdv avatar May 21 '25 19:05 Cmdv

@Cmdv - Wouldnt that occur across all instances that the snapshot was restored on? In this case, there are - in fact - two specific machines that go through upgrades exactly at same time (supposed to be replicas in different geography). Having said that , @hodlonaut - we probably should potentially do a dump from all instances across networks on our ends and do a compare if it's just a one-off (if so, I'd not bother with it)

rdlrt avatar May 22 '25 00:05 rdlrt

I'm going to take a good look at this when my head is out of the water with the PR I've been working on for the past few months!!!

Cmdv avatar May 23 '25 20:05 Cmdv

I'm going to take a good look at this when my head is out of the water with the PR I've been working on for the past few months!!!

thanks, please also if you have a chance have a look at the question I posed earlier, as it'd be helpful to pinpoint the exact component that's serving up original incorrect data.

gregbgithub avatar Jun 12 '25 00:06 gregbgithub