graph-node
graph-node copied to clipboard
[Bug] Rewinding a subgraph causes a constraint violation in graph-node that in turn causes indexer-agent to crashloop
Bug report
graph-node:v0.34.1
indexer-agent:v0.20.22
Activities that were undertaken before observing this bug:
- Cleared call_cache for Arbitrum as part of a complex subgraph sync performance troubleshooting exercise via psql
- Rewound a specific problematic subgraph, Silo Finance Arbitrum,
QmTMKqty5yZvZtB3SwzXUG92aZUH1YQw3VjByGw4wgaMhWto block 1 usinggraphman - Observed the above subgraph syncing to ~130m blocks, then stalled.
- Checked graph-node logs and found related error (see log output)
- Observed indexer-agent complaining about same issue and crash looping - cannot use the agent at all right now to manage subgraphs (see log output)
IMPACT: Production Indexer at risk; we cannot manage our online and offline allocations while we have this issue - ideally need a temp fix for the specific symptoms. Would graphman drop resolve the issue? Would the graph-node and indexer-agent be able to handle that and start syncing the sub again from scratch given this is a subgraph in flight with live allocations?
Relevant log output
----- GRAPH-NODE
Apr 04 13:24:34.037 ERRO Subgraph instance failed to run: internal constraint violated: Subgraph writer for QmTMKqty5yZvZtB3SwzXUG92aZUH1YQw3VjByGw4wgaMhW[sgd622] is not running, sgd: 622, subgraph_id: QmTMKqty5yZvZtB3SwzXUG92aZUH1YQw3VjByGw4wgaMhW, component: SubgraphInstanceManager
Apr 04 13:48:18.741 WARN Price provider Removed: 0x8dca64a43865454f41aa1a3cf0140eb89f2c08aa53871235ecbe46b6a309a1e3, data_source: PriceProvidersRepository, sgd: 622, subgraph_id: QmTMKqty5yZvZtB3SwzXUG92aZUH1YQw3VjByGw4wgaMhW, component: SubgraphInstanceManager > UserMapping
Apr 04 13:48:18.742 ERRO Oracle was not found when trying to remove it at txn: 0x8dca64a43865454f41aa1a3cf0140eb89f2c08aa53871235ecbe46b6a309a1e3, data_source: PriceProvidersRepository, sgd: 622, subgraph_id: QmTMKqty5yZvZtB3SwzXUG92aZUH1YQw3VjByGw4wgaMhW, component: SubgraphInstanceManager > UserMapping
----- INDEXER-AGENT
{"level":50,"time":1712241706593,"pid":1,"hostname":"268ad9e1400b","name":"IndexerAgent","component":"GraphNode","err":{"type":"IndexerError","message":"Failed to query indexing status API","stack":"IndexerError: Failed to query indexing status API\n at indexerError (/opt/indexer/packages/indexer-common/dist/errors.js:173:12)\n at GraphNode.<anonymous> (/opt/indexer/packages/indexer-common/dist/graph-node.js:146:55)\n at Generator.next (<anonymous>)\n at fulfilled (/opt/indexer/packages/indexer-common/dist/graph-node.js:5:58)\n at processTicksAndRejections (node:internal/process/task_queues:96:5)","code":"IE018","explanation":"https://github.com/graphprotocol/indexer/blob/main/docs/errors.md#ie018","cause":{"type":"CombinedError","message":"[GraphQL] Store error: internal constraint violated: the entityCount for QmTMKqty5yZvZtB3SwzXUG92aZUH1YQw3VjByGw4wgaMhW is not representable as a u64","name":"CombinedError","graphQLErrors":[{"message":"Store error: internal constraint violated: the entityCount for QmTMKqty5yZvZtB3SwzXUG92aZUH1YQw3VjByGw4wgaMhW is not representable as a u64"}],"response":{"size":0,"timeout":0}}},"msg":"Failed to query indexing status API"}
IPFS hash
No response
Subgraph name or link to explorer
https://thegraph.com/explorer/subgraphs/2ufoztRpybsgogPVW6j9NTn1JmBWFYPKbP7pAabizADU?view=Overview&chain=arbitrum-one
Some information to help us out
- [ ] Tick this box if this bug is caused by a regression found in the latest release.
- [ ] Tick this box if this bug is specific to the hosted service.
- [X] I have searched the issue tracker to make sure this issue is not a duplicate.
OS information
Linux
the entityCount for QmTMKqty5yZvZtB3SwzXUG92aZUH1YQw3VjByGw4wgaMhW is not representable as a u64
Maybe the rewind somehow turned the entity count negative. Which is a bug of course.
@leoyvens I think the problem was coming from that rewind to block 1 when the startblock was actually 51880000 That means the graphnode doesn't handle that scenario, and it created all that chaos.
Looks like this issue has been open for 6 months with no activity. Is it still relevant? If not, please remember to close it.