neo-go
neo-go copied to clipboard
Restarted W10 node sometimes becomes stuck adding new block (onPersist insufficient funds)
Current Behavior
Attempting to start an out-of-sync node produces:
2023-12-19T15:57:41.356Z WARN blockQueue: failed adding block into the blockchain {"error": "onPersist failed: VM has failed: at instruction 0 (SYSCALL): insufficient funds", "blockHeight": 4578499, "nextIndex": 4578500}
Expected Behavior
The node should successfully sync the next block.
Possible Solution
Something related to https://github.com/nspcc-dev/neo-go/pull/3110 maybe? Not looked too deep into it.
Steps to Reproduce
- Fully or partially sync a node then kill it.
- Restart the node. If it progresses, kill it again.
- Repeat 2 until the error occurs.
Context
Ran into this issue with my indexer node. By default the node is not kept alive, it is fully synchronized then shut down once all operations are complete. Occasionally, but not always, starting the node again produces this error. If it does, no further progress can be made without a full resync.
Your Environment
Windows 10
> ./neogo -v
NeoGo
Version: 0.104.0
GoVersion: go1.21.4
Are you using Level or Bolt?
LevelDB.
That's more interesting. It obviously means a broken DB, but the way it's handled internally it's not trivial for it to get into this state. Yet if it happens, there should be a reason for that.
CC @532910.
Uncovered some memory issues on my machine in recent days. Could well be a hardware fault that is responsible, so if you can't replicate, don't spend too long on it.
I was able to reproduce a similar bug on my local machine under the following conditions:
- Low disk space left (< 100 Mb), LevelDB is used (BoltDB can't function at all in such restricted conditions).
- DB resync from trusted dump is started, but the process was killed by OOM killer.
- Free a bit of disk space.
- DB resync from trusted dump is resumed and in a while I got:
2024-01-12T11:46:43.686+0300 WARN contract invocation failed {"tx": "591473831fa9508870afb22d89e595edb40306ff49bc2373d6b9bbc87abbda0e", "block": 2913226, "error": "at instruction 1 (SYSCALL): method 'Callback' not found"}
2024-01-12T11:46:43.700+0300 WARN contract invocation failed {"tx": "40d7d6be4922724174072b7c60f0cce175a0d470a23235b02e584fafcb9d0ec7", "block": 2913281, "error": "at instruction 1 (SYSCALL): method 'Callback' not found"}
2024-01-12T11:46:44.011+0300 INFO shutting down service {"service": "Prometheus", "endpoint": "[::]:2112"}
2024-01-12T11:47:21.743+0300 INFO persisted to disk {"blocks": 5852, "keys": 118232, "headerHeight": 2876784, "blockHeight": 2876783, "took": "43.997511933s"}
2024-01-12T11:47:36.029+0300 INFO persisted to disk {"blocks": 37886, "keys": 774298, "headerHeight": 2914670, "blockHeight": 2914669, "took": "14.285445886s"}
failed to add block 2914670: onPersist failed: VM has failed: at instruction 0 (SYSCALL): insufficient funds
I also was able to reproduce the same problem during network syncing with low disk space, so from my side this problem also seems to be connected with hardware problems.
Do you have the DB still? Can we check what's the state difference? It shouldn't happen this way, whatever written must be consistent, it's a single transaction after all (or the DB itself is broken if it can't ensure consistency on commit failure because of limited disk space).
I don't, I've removed the DB to test the release compatibility, but for some reason I suspect that the problem can be reproduced. I'll try to reproduce it and save the DB to look at the diffs.
I have reproduced the same problem with boltdb, but can't share the database, unfortunately. My plan is to look into all the keys we have and look at the diff between corrupted db and freshly restored one.
@roman-khimov @AnnaShaleva Do you have any specific place in mind to look into?
Oh, I was too soon to write, my problem seems a bit different:
2024-04-05T14:33:03.605Z WARN blockQueue: failed adding block into the blockchain {"error": "transaction 66385f00f74653c40ff2b0aaa07b394670abf853977ac309755a5c86d4e36aca failed to verify: insufficient funds", "blockHeight": 36734, "nextIndex": 36735}
But there are other nodes which are OK, so it seems an application/db issue, likely related to this one.
Zero ideas on this one. It'd be cool if you'll be able to at least do the diff between proper state and whatever is in this broken DB. Luckily, your chain doesn't seem to be long.
My version of neo-go is 0.104 with some commits after (base at 441eb8aa86be9) with fixes from #3279 (we reverted and reapplied #3110, but there were no conflicts so I do not expect problems here) and some unrelated improvements in the network server (mostly for multi-IP and TLS)
- I have a supposedly corruppted DB which contains blocks up to 36734 and some headers after.
- From this node I have exported chain dump.
- I restored dump up to 36734 block to get a fresh database.
- Dumped all they keys in hex to a file and compared diff:
Sanity check: values at 0xc1 are different as corrupted node has acquired some headers, values at 0xc0 are the same.
That being said here are seemingly relevant diffs stored by 0x70 prefix (contract data):
Native neo (0xfbffffff == -5
)
0x17 = 23 = prefixVoterRewardPerCommittee
< 70fbffffff0e: 400841022821020c54c4a3f309e5f2e388950dae194d12802704c789923064ca71407197cce0ff21004102282102731f3a03f1cb00e42abc39babbbd85289f8964352a201bea483b945081cac09021004102282102780ab53cd434ae48657b609351c595bfee165e2e148d37821d1793c85336900c2100410228210320a445bde3e92223695e9131fc60d0accfe274facd0723ff5c74421dc3ec18c9210041022821034f6caac07bfd7c4f3522a7c9aa787ad72ec073fceee87901ca6aaf692467e23c21004102282103ba68d87e378154bda1eef17276d594cb5335c57eba927b7ec14289d17ab1908221004102282103d05b5b040baced5e0006485c8565cd3eb69754faa0ea6c694f1099feb608f2b721004102282103e774485cb0a46389bfaf565937f54e31dbbb5bcff5cb9a5f220e6dd5445b16372100
< 70fbffffff1410ee5df183cc0f45e9beda75964628ee0471f9f8: 4104210420bcbe002103778f002821034f6caac07bfd7c4f3522a7c9aa787ad72ec073fceee87901ca6aaf692467e23c21060080d8ce6905
< 70fbffffff143a017ecb1a4f0afd95e28a856109c006db91a2e2: 4104210420bcbe0021037e8f002821020c54c4a3f309e5f2e388950dae194d12802704c789923064ca71407197cce0ff21060080d8ce6905
< 70fbffffff145061b45aa4b5fa5293310a5300f0968099f4d145: 4104210420bcbe0021037d8f00282102731f3a03f1cb00e42abc39babbbd85289f8964352a201bea483b945081cac09021060080d8ce6905
< 70fbffffff145a296b9b7035d8c1dc74ef80bb7ea35be8decfa0: 4104210420bcbe002103598f00282102780ab53cd434ae48657b609351c595bfee165e2e148d37821d1793c85336900c21060080d8ce6905
< 70fbffffff145fd4c6252239f6d810987dd4344ed8fe6cf2a5f8: 4104210420bcbe0021037b8f0028210320a445bde3e92223695e9131fc60d0accfe274facd0723ff5c74421dc3ec18c921060080d8ce6905
< 70fbffffff1485d8114bf1c98b3cb0511d335e95bbc5577d241b: 4104210420bcbe0021036f8f00282103d05b5b040baced5e0006485c8565cd3eb69754faa0ea6c694f1099feb608f2b721060080d8ce6905
< 70fbffffff14d6271d847b9f5c752b774f783c3ec540b87b0856: 4104210420bcbe0021036a8f00282103e774485cb0a46389bfaf565937f54e31dbbb5bcff5cb9a5f220e6dd5445b163721060080d8ce6905
< 70fbffffff14f33a2a2297f7f25642289180eb8f015ab9f8458a: 4104210420bcbe002103398f00282103ba68d87e378154bda1eef17276d594cb5335c57eba927b7ec14289d17ab1908221060080d8ce6905
< 70fbffffff17020c54c4a3f309e5f2e388950dae194d12802704c789923064ca71407197cce0ff: 0080d8ce6905
< 70fbffffff1702731f3a03f1cb00e42abc39babbbd85289f8964352a201bea483b945081cac090: 0080d8ce6905
< 70fbffffff1702780ab53cd434ae48657b609351c595bfee165e2e148d37821d1793c85336900c: 0080d8ce6905
< 70fbffffff170320a445bde3e92223695e9131fc60d0accfe274facd0723ff5c74421dc3ec18c9: 0080d8ce6905
< 70fbffffff17034f6caac07bfd7c4f3522a7c9aa787ad72ec073fceee87901ca6aaf692467e23c: 0080d8ce6905
< 70fbffffff1703ba68d87e378154bda1eef17276d594cb5335c57eba927b7ec14289d17ab19082: 0080d8ce6905
< 70fbffffff1703d05b5b040baced5e0006485c8565cd3eb69754faa0ea6c694f1099feb608f2b7: 0080d8ce6905
< 70fbffffff1703e774485cb0a46389bfaf565937f54e31dbbb5bcff5cb9a5f220e6dd5445b1637: 0080d8ce6905
---
> 70fbffffff0e: 400841022821020c54c4a3f309e5f2e388950dae194d12802704c789923064ca71407197cce0ff210420bcbe00410228210320a445bde3e92223695e9131fc60d0accfe274facd0723ff5c74421dc3ec18c9210420bcbe0041022821034f6caac07bfd7c4f3522a7c9aa787ad72ec073fceee87901ca6aaf692467e23c210420bcbe004102282102731f3a03f1cb00e42abc39babbbd85289f8964352a201bea483b945081cac090210420bcbe004102282102780ab53cd434ae48657b609351c595bfee165e2e148d37821d1793c85336900c210420bcbe004102282103ba68d87e378154bda1eef17276d594cb5335c57eba927b7ec14289d17ab19082210420bcbe004102282103d05b5b040baced5e0006485c8565cd3eb69754faa0ea6c694f1099feb608f2b7210420bcbe004102282103e774485cb0a46389bfaf565937f54e31dbbb5bcff5cb9a5f220e6dd5445b1637210420bcbe00
> 70fbffffff1410ee5df183cc0f45e9beda75964628ee0471f9f8: 4104210420bcbe002103778f002821034f6caac07bfd7c4f3522a7c9aa787ad72ec073fceee87901ca6aaf692467e23c210600205917580d
> 70fbffffff143a017ecb1a4f0afd95e28a856109c006db91a2e2: 4104210420bcbe0021037e8f002821020c54c4a3f309e5f2e388950dae194d12802704c789923064ca71407197cce0ff2106004015d6580d
> 70fbffffff145061b45aa4b5fa5293310a5300f0968099f4d145: 4104210420bcbe0021037d8f00282102731f3a03f1cb00e42abc39babbbd85289f8964352a201bea483b945081cac0902106004015d6580d
> 70fbffffff145a296b9b7035d8c1dc74ef80bb7ea35be8decfa0: 4104210420bcbe002103598f00282102780ab53cd434ae48657b609351c595bfee165e2e148d37821d1793c85336900c210600c024db550d
> 70fbffffff145fd4c6252239f6d810987dd4344ed8fe6cf2a5f8: 4104210420bcbe0021037b8f0028210320a445bde3e92223695e9131fc60d0accfe274facd0723ff5c74421dc3ec18c92106004015d6580d
> 70fbffffff1485d8114bf1c98b3cb0511d335e95bbc5577d241b: 4104210420bcbe0021036f8f00282103d05b5b040baced5e0006485c8565cd3eb69754faa0ea6c694f1099feb608f2b7210600009d58570d
> 70fbffffff14d6271d847b9f5c752b774f783c3ec540b87b0856: 4104210420bcbe0021036a8f00282103e774485cb0a46389bfaf565937f54e31dbbb5bcff5cb9a5f220e6dd5445b1637210600009d58570d
> 70fbffffff14f33a2a2297f7f25642289180eb8f015ab9f8458a: 4104210420bcbe002103398f00282103ba68d87e378154bda1eef17276d594cb5335c57eba927b7ec14289d17ab190822106004034e0520d
> 70fbffffff17020c54c4a3f309e5f2e388950dae194d12802704c789923064ca71407197cce0ff: 004015d6580d
> 70fbffffff1702731f3a03f1cb00e42abc39babbbd85289f8964352a201bea483b945081cac090: 004015d6580d
> 70fbffffff1702780ab53cd434ae48657b609351c595bfee165e2e148d37821d1793c85336900c: 004015d6580d
> 70fbffffff170320a445bde3e92223695e9131fc60d0accfe274facd0723ff5c74421dc3ec18c9: 004015d6580d
> 70fbffffff17034f6caac07bfd7c4f3522a7c9aa787ad72ec073fceee87901ca6aaf692467e23c: 004015d6580d
> 70fbffffff1703ba68d87e378154bda1eef17276d594cb5335c57eba927b7ec14289d17ab19082: 004015d6580d
> 70fbffffff1703d05b5b040baced5e0006485c8565cd3eb69754faa0ea6c694f1099feb608f2b7: 004015d6580d
> 70fbffffff1703e774485cb0a46389bfaf565937f54e31dbbb5bcff5cb9a5f220e6dd5445b1637: 004015d6580d
Native notary (0xf6ffffff == -10
):
< 70f6ffffff013e93a75eaa8181d369821e4954e179f6a1731d82: 41022105973a799b072105ffffffff00
---
> 70f6ffffff013e93a75eaa8181d369821e4954e179f6a1731d82: 41022105df21f8e4082105ffffffff00
2657172c2757143
< 70f6ffffff01829cb414d5790e699a889db54a0d6b5d29bf2f08: 410221066b73b712e6002105ffffffff00
---
> 70f6ffffff01829cb414d5790e699a889db54a0d6b5d29bf2f08: 410221061820a8850c012105ffffffff00
2657198c2757169
< 70f6ffffff01ac77a8afa4216f24b548e236428d03024bca86ab: 41022106ff1db1f4e4002105ffffffff00
---
> 70f6ffffff01ac77a8afa4216f24b548e236428d03024bca86ab: 4102210630c082270b012105ffffffff00
2657212c2757183
< 70f6ffffff01c0cf46845aee71fcc27fd062253d547a83be83a3: 41022106a342145ae5002105ffffffff00
---
> 70f6ffffff01c0cf46845aee71fcc27fd062253d547a83be83a3: 41022106ada5bc9f0b012105ffffffff00
2657219c2757190
< 70f6ffffff01c857e2afbf2d669ede44ebf1a0ce022c54497c04: 41022105654b581c092105ffffffff00
---
> 70f6ffffff01c857e2afbf2d669ede44ebf1a0ce022c54497c04: 410221057f25f0a20a2105ffffffff00
2657224c2757195
< 70f6ffffff01d8da4f4721649e2a8144dbf183e8b3e030c8e6cd: 410221063c8a0135e5002105ffffffff00
---
> 70f6ffffff01d8da4f4721649e2a8144dbf183e8b3e030c8e6cd: 410221069aed55720b012105ffffffff00
2657235c2757206
< 70f6ffffff01e43c77ed610894ee38215cd9fd56eb2166dca74c: 41022106793bdc12e6002105ffffffff00
---
> 70f6ffffff01e43c77ed610894ee38215cd9fd56eb2166dca74c: 4102210605d7bf790c012105ffffffff00
There is also diff for GAS contract, it is a bit bigger, won't attach here.
Script I used to produce files:
func TestExport(t *testing.T) {
runExport(t, "./mainnet.bolt")
runExport(t, "./corrupted.mainnet.bolt")
}
func runExport(t *testing.T, path string) {
db, err := bbolt.Open(path, os.ModePerm, &bbolt.Options{
Timeout: time.Second,
ReadOnly: true,
})
require.NoError(t, err)
defer db.Close()
f, err := os.OpenFile(path+".keys", os.O_CREATE|os.O_TRUNC|os.O_WRONLY, os.ModePerm)
require.NoError(t, err)
defer f.Close()
require.NoError(t, db.View(func(tx *bbolt.Tx) error {
printBucket(f, tx.Bucket([]byte("DB")), "")
return nil
}))
}
func printBucket(f *os.File, b *bbolt.Bucket, indent string) {
c := b.Cursor()
for k, v := c.First(); k != nil; k, v = c.Next() {
if v != nil {
f.WriteString(indent)
f.WriteString(hex.EncodeToString(k))
f.WriteString(": ")
f.WriteString(hex.EncodeToString(v))
f.WriteString("\n")
} else {
printBucket(f, b.Bucket(k), indent+hex.EncodeToString(k)+" : ")
}
}
}
@EdgeDLT, @fyfyrchik, please take a look at the https://github.com/nspcc-dev/neo-go/pull/3443. It should fix the problem, but it needs to be confirmed, thus I'm not closing this issue.
@EdgeDLT, @fyfyrchik, 0.106.0 is out, you may try it. Please, write back if succeeded.