eos icon indicating copy to clipboard operation
eos copied to clipboard

ship stopped writing new blocks

Open cc32d9 opened this issue 5 years ago • 19 comments
trafficstars

eosio-2.0.7, Ubuntu 18.04. The node has resynched Telos from genesis. All of a sudden, it stopped writing state history logs, without any messages related to the moment when it stopped (log level is set to error).

config.ini (producer_api_plugin was added several hours after this happened):

chain-state-db-size-mb = 65536
reversible-blocks-db-size-mb = 2048

wasm-runtime = eos-vm-jit
eos-vm-oc-enable = true

p2p-accept-transactions = false
api-accept-transactions = false

http-server-address = 0.0.0.0:8802
p2p-listen-endpoint = 0.0.0.0:9802
p2p-server-address = :9857

access-control-allow-origin = *

#verbose-http-errors = true
contracts-console = true

max-clients = 100

plugin = eosio::chain_plugin
plugin = eosio::chain_api_plugin
plugin = eosio::db_size_api_plugin

plugin = eosio::producer_api_plugin

plugin = eosio::state_history_plugin
trace-history = true
chain-state-history = true
trace-history-debug-mode = true
state-history-endpoint = 0.0.0.0:8082

sync-fetch-span = 10

p2p-peer-address = p2p.telosarabia.net:9876
p2p-peer-address = telos.cryptolions.io:9871
p2p-peer-address = telos.eosdublin.io:9876
root@eosio:/srv/telos# ls -al data/state-history/
total 291419499
drwxr-xr-x 2 root root            6 Sep  6 13:31 .
drwxr-xr-x 6 root root            6 Sep  6 13:31 ..
-rw-r--r-- 1 root root    831008728 Sep  9 08:12 chain_state_history.index
-rw-r--r-- 1 root root 117728914750 Sep  9 08:12 chain_state_history.log
-rw-r--r-- 1 root root    831008728 Sep  9 08:12 trace_history.index
-rw-r--r-- 1 root root 178821446038 Sep  9 08:12 trace_history.log
root@eosio:/srv/telos# ls -al data/blocks/
total 47089027
drwxr-xr-x 3 root root            5 Sep  6 13:31 .
drwxr-xr-x 6 root root            6 Sep  6 13:31 ..
-rw-r--r-- 1 root root    877516224 Sep 10 07:29 blocks.index
-rw-r--r-- 1 root root 128877141011 Sep 10 07:29 blocks.log
drwxr-xr-x 2 root root            3 Sep  6 13:31 reversible

After restarting nodeos, it prints the blocks range in state history, and doesn't collect any new data. The websocket is responding, but obviously not delivering any traces.

In attached log, I restarted nodeos with default logging at Sep 10 07:32:54.It started printing "missed a block in trace_history.log"

telos.nodeos.log

cc32d9 avatar Sep 10 '20 08:09 cc32d9

the node went on with the head block, so it's difficult to reproduce or test a bugfix. But what it should've done is stop and not continue after there was a failure.

cc32d9 avatar Sep 10 '20 08:09 cc32d9

Have you been able to do this successfully before? Just trying to determine if something may have changed recently that caused this.

aclark-b1 avatar Sep 16 '20 17:09 aclark-b1

no, it's the first time that I see the state history data corrupted. But what is a systematic problem, is that nodeos prints a warning and goes ahead, instead of aborting if the data is corrupted.

cc32d9 avatar Sep 16 '20 17:09 cc32d9

Just experienced it with 2.0.6 on WAX on two independent servers within a couple of days. It just stops updating state history silently. The network has been experiencing intensive forks. This behavior is quite random. I'm maintaining 4 ship servers on WAX, and only one falls out at a time.

cc32d9 avatar Feb 22 '21 22:02 cc32d9

It looks like it's related to increased forks frequency, so #10080 might be related

cc32d9 avatar Feb 22 '21 22:02 cc32d9

We have seen this several times on EOS recently where we get this warning with every block:

missed a block in chain_state_history.log
    {"name":"chain_state_history"}
    nodeos  state_history_log.hpp:92 write_entry

And we have to stop the node and restore from backup to fix it. This is on 2.0.9.

n8d avatar Feb 22 '21 22:02 n8d

in my case, logging was limited to error level, so no logs here

cc32d9 avatar Feb 22 '21 22:02 cc32d9

We have seen this several times on EOS recently where we get this warning with every block:

Do you have a full log available. Interested in if there was a warn/error before this started warning on every block. Once in this state it really can't recover without a restore from backup.

heifner avatar Feb 23 '21 17:02 heifner

in my case, logging was limited to error level, so no logs here

Would recommend you run with at least warn level logging.

heifner avatar Feb 23 '21 17:02 heifner

We have seen this several times on EOS recently where we get this warning with every block:

Do you have a full log available. Interested in if there was a warn/error before this started warning on every block. Once in this state it really can't recover without a restore from backup.

Looking closer, it actually happened with a restart (we do auto stop + zfs snap + start). On stop it put this in the log:

corrupted size vs. prev_size

And then when it started it immediately had the missed block in state history issue.

n8d avatar Feb 23 '21 22:02 n8d

see "crash on exit with error "corrupted size vs. prev_size" https://github.com/EOSIO/eos/issues/8450

matthewdarwin avatar Feb 23 '21 22:02 matthewdarwin

see "crash on exit with error "corrupted size vs. prev_size" #8450

Yep, I thought that one was fixed long ago but doesn't look like it or maybe a recent regression.

n8d avatar Feb 23 '21 22:02 n8d

whatever the cause is, nodeos should stop if it cannot keep writing the ship data. Currently this failure is difficult to detect because it keeps sending empty ship data via websicket.

cc32d9 avatar Feb 24 '21 22:02 cc32d9

Had a WAX SHiP node start missing blocks in the SHiP logs and see the following just before it started missing blocks:

fc: exeption: 8 out_of_range_exception: Out of Range write datastream of length 11327 over by -51 {"method:"::"write", "len":"11327","over":"-51"} nodeos datastream.cpp:6 throw_datastream_range_error

eosusa avatar Feb 25 '21 16:02 eosusa

Had this error occurring again few days ago, and that was the only node where I forgot to put read-mode=head. It was isolated from speculative transactions though:

p2p-accept-transactions = false
api-accept-transactions = false

Could also be related to https://github.com/EOSIO/eos/issues/9972

cc32d9 avatar Sep 14 '21 14:09 cc32d9

there's at least one occurrence of this error with read-mode=head. 2.0.13 (WAX) could not write to state history and stopped.

cc32d9 avatar Oct 21 '21 17:10 cc32d9

One more occasion, and this time it's not at head block: I was resynching a Telos ship node from genesis, and it stopped about halfway through the history (last block in blocks log: 103875765, current heads block is above 180M)

cc32d9 avatar Nov 17 '21 08:11 cc32d9

attached, is the log from the Telos node (log level is set to error). The node started from near the genesis on Nov 12 , then stopped on Nov 14 12:06:20. later on, I tried to start it on Nov 17. telos_ship.log

cc32d9 avatar Nov 17 '21 10:11 cc32d9

the error is quite definitely related to how busy the storage is for ship files. Seems like the write operation times out too early.

cc32d9 avatar Dec 03 '21 11:12 cc32d9