cardano-db-sync
cardano-db-sync copied to clipboard
Db-sync Hang Causing Midnight Node Outages
We have observed multiple incidents where db-sync hangs, causing Midnight node outages due to a failure in importing blocks. This issue has occurred intermittently and has been temporarily resolved with pod restarts, but a more robust solution is needed to prevent manual intervention.
Problem: db-sync hanging results in the node failing to import blocks, with the following error:
sync: :broken_heart: Verification failed for block 0xd0c306e29f09841635ae13ade6f9dce33e6ed2b3b565eac826154e23a89d475a received from (12D3KooWQF1x9ffPo73DRK8XKPw1Ev9BnJhNQc6QBke1tLnssumX): "Main chain state d23e68ee90dcc4677b2f67152daf8e08ebb3cf9507b9a587120882c280ed0c05 referenced in imported block at slot 290165258 with timestamp 1740991548000 not found"
db-sync is 3 hours behind the Cardano tip when discovered. The Cardano node is synced and importing blocks, indicating that the issue is isolated to db-sync. One db-sync pod entered this stuck state, requiring a manual restart to recover. Another db-sync pod self-recovered without intervention, though logs suggest it may have undergone an automatic pod refresh. We need a root cause analysis to determine why db-sync enters this state and a fix that eliminates the need for manual restarts.
Logs & Observations: The last log message from db-sync showed a successful tip import before pausing indefinitely. If db-sync stops receiving blocks, it does not necessarily throw an error, making detection and recovery more difficult.
Not from the team, but just commenting on basics:
- Version details for infrastructure you're running is missing, also the config options from dbsync (which has high impact on system requirements)
- It's unclear what you're trying to show with screenshot, it doesnt (atleast on quick brief) highlight any issues
- What is that log message you pasted from? it does not look like a typical dbsync formatted message.. What does dbsync logs show prior to / after mentioned hang?
- What are the infra specs where the mentioned container/pod is running?
- If pod was refreshed/restarted - it might especially point to infrastructure sizing
Another db-sync pod self-recovered without intervention, though logs suggest it may have undergone an automatic pod refresh.
What do you mean it "refreshed"? Did Kubernetes restart it? If so, can you find out out why it did? Also, are you saving off all the logs?
EDIT: Also worth noting, this is a preview node
- What is that log message you pasted from? it does not look like a typical dbsync formatted message.. What does dbsync logs show prior to / after mentioned hang?
This looks like a midnight node message
I'm also on the Midnight node team - I can answer the questions here:
Version details for infrastructure you're running is missing, also the config options from dbsync (which has high impact on system requirements)
This pod was running the ghcr.io/intersectmbo/cardano-db-sync:13.5.0.2 image - I checked the changelog to see if a fix was included in more recent versions - I didn't immediately spot any issue that might cause the hang, but I'll upgrade our images to the latest (13.6.0.4)
In terms on config, we have the following environment vars set:
│ Environment: │
│ NETWORK: preview │
│ CARDANO_NODE_SOCKET_PATH: /node-ipc/node.socket │
│ POSTGRES_DB: cexplorer │
│ POSTGRES_USER: cardano │
│ POSTGRES_PORT: 5432 │
│ POSTGRES_HOST: psql-dbsync-cardano-01-primary │
│ POSTGRES_DB: cexplorer │
│ POSTGRES_USER: cardano │
│ POSTGRES_PASSWORD: <set to the key 'password' in secret 'psql-dbsync-cardano-01-pguser-cardano'> Optional: false │
│ POSTGRES_PORT: 5432 │
It's unclear what you're trying to show with screenshot, it doesnt (atleast on quick brief) highlight any issues
The screenshot is only really important for the timestamps - the screenshot was taken at ~12:00, so it shows that db-sync has not progressed since then
What is that log message you pasted from? it does not look like a typical dbsync formatted message.. What does dbsync logs show prior to / after mentioned hang?
As pointed out by @sgillespie , the log is from the midnight-node. We're running as a partner-chain, that log message is from their code. It shows that it can't find the referenced state in the Cardano preview network ("main chain") - the source of the data is db-sync.
What are the infra specs where the mentioned container/pod is running?
I'll double-check this and comment back here
If pod was refreshed/restarted - it might especially point to infrastructure sizing
The pods are refreshed at regular intervals - I'll have to double check the reasoning behind it, partly it's a chaos engineering strategy.
What are the infra specs where the mentioned container/pod is running?
db-sync-cardano-10-0 Service Deployment Information
Pod Details
- Pod Name:
db-sync-cardano-10-0 - Namespace:
testnet-02 - Controller:
StatefulSet/db-sync-cardano-10 - Status:
Running - Pod IP:
10.14.90.5 - Node:
ip-10-14-90-40.eu-west-1.compute.internal
Image Information
- db-sync Container Image:
ghcr.io/intersectmbo/cardano-db-sync:13.5.0.2
Node
- CPU:
8 cores - Memory:
32.3 GiBtotal (31.3 GiBallocatable) - Ephemeral Storage:
~40 GiBtotal (~37.5 GiBallocatable)
Volume
- Available Volume for Syncing Data:
80 GiB