cronos icon indicating copy to clipboard operation
cronos copied to clipboard

Problem: tendermint tx indexer could miss block when restart

Open jun0tpyrc opened this issue 3 years ago • 7 comments

This is an older block at the time of query (119510)

[root@bprod-cronos-1 ~]# curl localhost:8545 -X POST -H "Content-Type: application/json" -d '{"jsonrpc":"2.0","method":"eth_getBlockByNumber","params": ["0x1D2D6", true],"id":1}' -s | jq

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "difficulty": "0x0",
    "extraData": "0x",
    "gasLimit": "0xffffffff",
    "gasUsed": "0x24d2b8",
    "hash": "0x4324b6e0116d22f6ce615f227276f8f974659c237f71acbd73d02981e225a05c",
    "logsBloom": "0x002000000200000400080000800200020000000c340800001020000080200310c00050008000010000200000000004020840000040e00000200000040021400200201400008200801040000c00001020000002000241000082010040800020000810040002200000000001001080080000020000000816002400401000000000812002000000c004000000000102091000000801d0006008100000408000000006000802101200010020044000002000800602000108000012220020000000800000000200028000880001000102200000020008c010101400002242020020000010110000801000000000000001000000004000400000400800004000002000",
    "miner": "0x81e3e543647e466a5abc824f5844ab0a091b6c6c",
    "mixHash": "0x0000000000000000000000000000000000000000000000000000000000000000",
    "nonce": "0x0000000000000000",
    "number": "0x1d2d6",
    "parentHash": "0xad2432f13c825d3a9163c9e6d518f08633b00ad6a1eb87c747ad7f2427dabe90",
    "receiptsRoot": "0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421",
    "sha3Uncles": "0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347",
    "size": "0x909c",
    "stateRoot": "0xea8aae6836bdc465215558bc1c7511717069c914eaae48e984af1415b6564a42",
    "timestamp": "0x6192aef2",
    "totalDifficulty": "0x0",
    "transactions": [],
    "transactionsRoot": "0x8b2974a39d3aeed1cbe3aa1654ab933a9bcb4d7215b411b8a6cf569c1783abde",
    "uncles": []
  }
}
^ compared to expected responses transactions array being empty now 

the node has newer blocks and keep ingest new blocks thoggh

{
  "blockNumber": 125710,
  "blockHash": "0x887475723ba801bb1b2ce68709b74941223652f09317127aaf15a4d49bfeb85e",
  "blockTime": "2021-11-16T04:47:34.000Z",
  "checkTime": "2021-11-16T04:47:50.836Z",
  "timeDiff": 16836
}

To Reproduce It occurs intermittently only in some of the node x86_64

jun0tpyrc avatar Nov 16 '21 05:11 jun0tpyrc

noticed this happened a few times yesterday as well, we had to manually resync the block on subgraph whenever this happen

crypto-steve-ng avatar Nov 18 '21 03:11 crypto-steve-ng

could be an error on tendermint @JayT106 is it possible that either e.clientCtx.Client.Block(e.ctx, &height) or e.clientCtx.Client.TxSearch return different value per nodes?

thomas-nguy avatar Nov 18 '21 03:11 thomas-nguy

could be an error on tendermint @JayT106 is it possible that either e.clientCtx.Client.Block(e.ctx, &height) or e.clientCtx.Client.TxSearch return different value per nodes?

First, the block data should be the same across the network nodes, otherwise, the consensus should break. We might debug EthBlockFromTendermint to see why the block data transform fail.

Second. the TxSearch searches the transaction in the indexer, so the node must have the KV indexer enabled. And the indexer might fail to index the transaction and block data when something wrong during indexing (when a new block is committed). So it's possible to have a different result from the different nodes.

JayT106 avatar Nov 18 '21 15:11 JayT106

could be an error on tendermint @JayT106 is it possible that either e.clientCtx.Client.Block(e.ctx, &height) or e.clientCtx.Client.TxSearch return different value per nodes?

First, the block data should be the same across the network nodes, otherwise, the consensus should break. We might debug EthBlockFromTendermint to see why the block data transform fail.

Second. the TxSearch searches the transaction in the indexer, so the node must have the KV indexer enabled. And the indexer might fail to index the transaction and block data when something wrong during indexing (when a new block is committed). So it's possible to have a different result from the different nodes.

Since the tendermint tx indexer service runs asynchronously with block commit, so it could lag behind. And eth_getBlockByNumber rpc API rely on /tx_search to find the tx, it's possible that it can't find it, especially for recent blocks.

yihuang avatar Nov 24 '21 02:11 yihuang

It reproduced in one of our nodes where tendermint tx indexer fails to index a whole block, and won't recover automatically, matching what's described by OP.

yihuang avatar Nov 24 '21 04:11 yihuang

https://github.com/tendermint/tendermint/issues/7312 We found that when restart node, there's chance that the tx indexer could miss a block.

yihuang avatar Nov 24 '21 06:11 yihuang

I tested it has been fixed in this PR. Not sure will it be backported to v0.34 https://github.com/tendermint/tendermint/issues/7312#issuecomment-978056789

JayT106 avatar Nov 24 '21 16:11 JayT106