dfuse-eosio icon indicating copy to clipboard operation
dfuse-eosio copied to clipboard

abicodec: internal server error while validating cursor

Open fschoell opened this issue 4 years ago • 10 comments

Getting those errors in our abicodec instances:

Aug 19 10:27:49 jungle-abic11 dfuseeos[28409]: 2020-08-19T10:27:49.513Z (abicodec) starting ABI syncer (abicodec/syncer.go:81)
Aug 19 10:27:54 jungle-abic11 dfuseeos[28409]: 2020-08-19T10:27:54.519Z (abicodec) abi codec stream abi changes (abicodec/syncer.go:83){"error": "search stream terminated with error: rpc error: code = InvalidArgument desc = internal server error while validating cursor"}
Aug 19 10:27:54 jungle-abic11 dfuseeos[28409]: 2020-08-19T10:27:54.519Z (abicodec) the search stream ended with error (abicodec/syncer.go:86){"error": "search stream terminated with error: rpc error: code = InvalidArgument desc = internal server error while validating cursor"}
Aug 19 10:27:55 jungle-abic11 dfuseeos[28409]: 2020-08-19T10:27:55.519Z (abicodec) starting ABI syncer (abicodec/syncer.go:81)
Aug 19 10:28:00 jungle-abic11 dfuseeos[28409]: 2020-08-19T10:28:00.524Z (abicodec) abi codec stream abi changes (abicodec/syncer.go:83){"error": "search stream terminated with error: rpc error: code = InvalidArgument desc = internal server error while validating cursor"}
Aug 19 10:28:00 jungle-abic11 dfuseeos[28409]: 2020-08-19T10:28:00.525Z (abicodec) the search stream ended with error (abicodec/syncer.go:86){"error": "search stream terminated with error: rpc error: code = InvalidArgument desc = internal server error while validating cursor"}
Aug 19 10:28:01 jungle-abic11 dfuseeos[28409]: 2020-08-19T10:28:01.525Z (abicodec) starting ABI syncer (abicodec/syncer.go:81)
Aug 19 10:28:06 jungle-abic11 dfuseeos[28409]: 2020-08-19T10:28:06.530Z (abicodec) abi codec stream abi changes (abicodec/syncer.go:83){"error": "search stream terminated with error: rpc error: code = InvalidArgument desc = internal server error while validating cursor"}
Aug 19 10:28:06 jungle-abic11 dfuseeos[28409]: 2020-08-19T10:28:06.530Z (abicodec) the search stream ended with error (abicodec/syncer.go:86){"error": "search stream terminated with error: rpc error: code = InvalidArgument desc = internal server error while validating cursor"}

fschoell avatar Aug 19 '20 10:08 fschoell

The problem is actually that the search router cannot access blockmeta... we need the log in search router saying "connecting to blockmeta to get irrBlk" it has the error...

sduchesneau avatar Aug 19 '20 19:08 sduchesneau

the cursor has a "block ID" in it, so the search router needs to know if it is from a block that was forked out or not, so it can send "undo/redo" responses... since blockmeta cannot tell it what is the blockID at this number, search router refuses to "continue with this cursor" because it could send wrong search responses. possible cause why the blockmeta could not give the answer:

  1. trxdb loader is lagging behind
  2. blockmeta is not receiving blocks anymore (network issue connecting to relayer...)

sduchesneau avatar Aug 19 '20 20:08 sduchesneau

Okay so trxdb-loader is up to date running live.

We have 3 blockmeta instances running (on 3 different machines). On each machine there's a relayer running and blockmeta connects to the relayer on the same machine. Each relayer itself is connected to all 3 mindreader instances running.

All the blockmeta services are stuck at different block numbers:

Blockmeta 1:

Aug 20 10:08:16 jungle-bm11 dfuseeos[68839]: 2020-08-20T10:08:16.162Z (blockmeta) behind lib (blockmeta/server.go:319){"block_num": 30352948}
Aug 20 10:08:17 jungle-bm11 dfuseeos[68839]: 2020-08-20T10:08:17.163Z (blockmeta) num to id called (blockmeta/server.go:313){"block_num": 30352948}

Blockmeta 2:

Aug 20 10:08:38 jungle-bm12 dfuseeos[38793]: 2020-08-20T10:08:38.731Z (blockmeta) behind lib (blockmeta/server.go:319){"block_num": 30470203}
Aug 20 10:08:39 jungle-bm12 dfuseeos[38793]: 2020-08-20T10:08:39.733Z (blockmeta) num to id called (blockmeta/server.go:313){"block_num": 30470203}

Blockmeta 3:

Aug 20 10:08:59 jungle-bm13 dfuseeos[92428]: 2020-08-20T10:08:59.223Z (blockmeta) behind lib (blockmeta/server.go:319){"block_num": 29320500}
Aug 20 10:09:00 jungle-bm13 dfuseeos[92428]: 2020-08-20T10:09:00.224Z (blockmeta) num to id called (blockmeta/server.go:313){"block_num": 29320500}

The relayers are looking fine though:

Aug 20 04:17:37 jungle-drly12 dfuseeos[40386]: 2020-08-20T04:17:37.041Z (relayer) subscribed (blockstream/server.go:215){"new_length": 4, "subscriber": ""}
Aug 20 04:17:37 jungle-drly12 dfuseeos[40386]: 2020-08-20T04:17:37.041Z (relayer.sub.search-live) sending burst (blockstream/server.go:205){"busrt_size": 300}
Aug 20 04:17:37 jungle-drly12 dfuseeos[40386]: 2020-08-20T04:17:37.041Z (relayer) subscribed (blockstream/server.go:215){"new_length": 5, "subscriber": "search-live"}
Aug 20 04:17:37 jungle-drly12 dfuseeos[40386]: 2020-08-20T04:17:37.498Z (relayer.js.file) reading from blocks store: file does not (yet?) exist, retrying in (bstream/filesource.go:139){"filename": "0031488000.dbin.zst", "base_filename": "0031488000", "retry_delay": "4s"}
Aug 20 04:17:37 jungle-drly12 dfuseeos[40386]: 2020-08-20T04:17:37.498Z (relayer.js.file) file not found callback set, calling it (bstream/filesource.go:143){"base_block_num": 31488000}
Aug 20 04:17:37 jungle-drly12 dfuseeos[40386]: 2020-08-20T04:17:37.498Z (relayer.js.merger) creating merger source due to filesource file not found callback (bstream/joiningsource.go:305){"file_not_found_base_block_num": 31488000}
Aug 20 04:17:38 jungle-drly12 dfuseeos[40386]: 2020-08-20T04:17:38.358Z (relayer.js) shutting file source, switching to live (from a merger block matching) (bstream/joiningsource.go:394)
Aug 20 04:17:41 jungle-drly12 dfuseeos[40386]: 2020-08-20T04:17:41.637Z (relayer.js) joining state LIVE (bstream/joiningsource.go:506){"last_live_block": 31488083}
Aug 20 04:17:42 jungle-drly12 dfuseeos[40386]: 2020-08-20T04:17:42.358Z (relayer.js.file) blocks archive streaming was asked to stop (bstream/filesource.go:125)
Aug 20 04:17:56 jungle-drly12 dfuseeos[40386]: 2020-08-20T04:17:56.636Z (relayer) now monitoring drift (relayer/relayer.go:305){"max_drift": "5m0s"}

Same goes for the mindreaders, they are all up to date as well.

We used snapshots on the mindreaders set them back due to holes in our one-block files. Could that be an issue for other dfuse components?

fschoell avatar Aug 20 '20 10:08 fschoell

How do you know if the relayer is up to date ?

  1. run this on it: curl localhost:9102 |grep 'relayer' 9for drift in seconds...)
  2. run this on it: grpcurl -plaintext -d '{"source":0}' localhost:9000 dfuse.headinfo.v1.HeadInfo/GetHeadInfo

How do you know if the blockmeta is lagging ? run this on it: grpcurl -plaintext -d '{"source":0}' localhost:9000 dfuse.headinfo.v1.HeadInfo/GetHeadInfo

Can you confirm the head block and drift of each and give me the output of the 3 commands ?

Also: any error log in the blockmeta ?

sduchesneau avatar Aug 20 '20 16:08 sduchesneau

Relayer seems fine:

head_block_time_drift{app="relayer"} -0.255444047

$ grpcurl -plaintext -d '{"source":0}' localhost:9000 dfuse.headinfo.v1.HeadInfo/GetHeadInfo
{
  "libNum": "32589425",
  "headNum": "32589755",
  "headID": "01f147bb72fa6bb4977a00062bab7b911394e733771a3c40b6d0d467f17f1f7b",
  "headTime": "2020-08-26T13:23:32.500Z"
}

Blockmeta seems to be up-to-date as well:

$ grpcurl -plaintext -d '{"source":0}' localhost:9000 dfuse.headinfo.v1.HeadInfo/GetHeadInfo
{
  "libNum": "32589641",
  "libID": "01f147496c24a2a6ec13f4aa72890d44fed4f6be79e6fbdcae373bc608d1d672",
  "headNum": "32589975",
  "headID": "01f14897bb474cf21dd2711ac8020f2002b96da681725ff9a14a98ee00b3d4ed",
  "headTime": "2020-08-26T13:25:22.500Z"
}

fschoell avatar Aug 26 '20 13:08 fschoell

Ok, this means that the blockmeta are NOT stuck at this blockheight, they are OK. Maybe the abicodec has a cursor that points to a block that was forked out (hence the call to blockmeta num_to_id) and the fork resolving mechanism is broken for some reason ?

Can you start your abicodec in debug mode and look for that line: streaming abi changes (it should contain the cursor, and the cursor is basically a block ID + block number + transaction ID prefix encoded in protobuf)

Getting this cursor should confirm if A) it is the same number as of one of the calls to num_to_id on blockmeta B) it is a forked block ID

sduchesneau avatar Aug 26 '20 13:08 sduchesneau

That's the debug output from abicodec:

Aug 26 13:45:11 jungle-abic11 dfuseeos[16142]: 2020-08-26T13:45:11.967Z (abicodec) starting ABI syncer (abicodec/syncer.go:81)
Aug 26 13:45:11 jungle-abic11 dfuseeos[16142]: 2020-08-26T13:45:11.968Z (abicodec) streaming abi changes (abicodec/syncer.go:100){"cursor": "1:30352948:01cf2634aee0cc387c1631d68b66f56de0775d3745ee45ce983086d1a6f2c3b9:"}
Aug 26 13:45:11 jungle-abic11 dfuseeos[16142]: 2020-08-26T13:45:11.968Z (common_client) starting dhammer loop ([email protected]/common_client.go:85)
Aug 26 13:45:11 jungle-abic11 dfuseeos[16142]: 2020-08-26T13:45:11.968Z (common_client) search stream performing actual search request ([email protected]/common_client.go:32)
Aug 26 13:45:11 jungle-abic11 dfuseeos[16142]: 2020-08-26T13:45:11.968Z (common_client) search stream starting receive loop ([email protected]/common_client.go:39)
Aug 26 13:45:16 jungle-abic11 dfuseeos[16142]: 2020-08-26T13:45:16.974Z (common_client) search stream loop completed ([email protected]/common_client.go:27)
Aug 26 13:45:16 jungle-abic11 dfuseeos[16142]: 2020-08-26T13:45:16.974Z (dhammer) input reader channel closed ([email protected]/dhammer.go:118)
Aug 26 13:45:16 jungle-abic11 dfuseeos[16142]: 2020-08-26T13:45:16.974Z (eos_client) processing hammer batch (search-client/eos_client.go:62){"item_count": 1}
Aug 26 13:45:16 jungle-abic11 dfuseeos[16142]: 2020-08-26T13:45:16.974Z (dhammer) input reader no more inflight and channel closed, closing decoupler ([email protected]/dhammer.go:138)
Aug 26 13:45:16 jungle-abic11 dfuseeos[16142]: 2020-08-26T13:45:16.974Z (dhammer) single batch shutter terminating ([email protected]/dhammer.go:226)
Aug 26 13:45:16 jungle-abic11 dfuseeos[16142]: 2020-08-26T13:45:16.974Z (dhammer) linearizer output single batch error, shutting down ([email protected]/dhammer.go:210){"error": "EOF"}
Aug 26 13:45:16 jungle-abic11 dfuseeos[16142]: 2020-08-26T13:45:16.974Z (dhammer) linearizer terminated, closing out channel ([email protected]/dhammer.go:190)
Aug 26 13:45:16 jungle-abic11 dfuseeos[16142]: 2020-08-26T13:45:16.974Z (common_client) dhammer channel closed ([email protected]/common_client.go:97)
Aug 26 13:45:16 jungle-abic11 dfuseeos[16142]: 2020-08-26T13:45:16.974Z (common_client) sending error from dhammer to consumer ([email protected]/common_client.go:99)
Aug 26 13:45:16 jungle-abic11 dfuseeos[16142]: 2020-08-26T13:45:16.974Z (common_client) dhammer loop completed ([email protected]/common_client.go:87)
Aug 26 13:45:16 jungle-abic11 dfuseeos[16142]: 2020-08-26T13:45:16.974Z (abicodec) abi codec stream abi changes (abicodec/syncer.go:83){"error": "search stream terminated with error: rpc error: code = InvalidArgument desc = internal server error while validating cursor"}
Aug 26 13:45:16 jungle-abic11 dfuseeos[16142]: 2020-08-26T13:45:16.974Z (abicodec) the search stream ended with error (abicodec/syncer.go:86){"error": "search stream terminated with error: rpc error: code = InvalidArgument desc = internal server error while validating cursor"}

fschoell avatar Aug 26 '20 13:08 fschoell

Is this the block id: 01cf2634aee0cc387c1631d68b66f56de0775d3745ee45ce983086d1a6f2c3b9 ?

Then we have a fork, it's different that the one of block 30352948 from blocks: https://jungle.bloks.io/block/30352948

fschoell avatar Aug 26 '20 13:08 fschoell

It also explains that log line on blockmeta, which was mistaken for an indication of the blockmeta itself lagging:

Aug 20 10:08:17 jungle-bm11 dfuseeos[68839]: 2020-08-20T10:08:17.163Z (blockmeta) num to id called (blockmeta/server.go:313){"block_num": 30352948}

You will need an instance of "forkresolver" (which is a small search server that does nothing except register into dmesh so the search-router knows it exists, and calls it when it needs to start from a forked block, download blockfiles and find how to go back to the longest chain)

I'm digging in the code to see why you didn't get a clearer error message...

sduchesneau avatar Aug 26 '20 13:08 sduchesneau

Could you check with dfuseeos tools in your trxdb to see if you are missing irreversibility data for that block height ?

dfuseeos tools db blk  30352948 --dsn whatever://your-dsn-is/for/trxdb | jq '{"id": .id,"irreversible": .irreversible}'

sduchesneau avatar Aug 26 '20 14:08 sduchesneau