Failed to process blob sidecars request
I've seen this kind of issue twice in the past month, latest happened earlier this morning. Teku will be running normally then show IO failures in the logs (see below), retry for a while, then apparently shut itself down.
My resolution in these cases has been to manually resync to rebuild the data. Restarting Teku has not worked.
Latest incident was on version 25.7.1. This is running on Windows 10.
Today's log containing failures:
For future searchers, some excerpts:
beaconchain-async-13 | ERROR | BlobSidecarsByRangeMessageHandler | Failed to process blob sidecars request
org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: D:\teku\data\beacon\db\305259.sst: Could not create random access file.
CombinedStorageChannel-0 | ERROR | RetryingStorageUpdateChannel | Storage update failed, retrying.
FATAL | teku-status-log | Exiting due to fatal error in RetryingStorageUpdateChannel
Thanks for raising. Being unable to create a storage file for the database isn't great - is your disk close to full?
No, plenty of free space and no indication of disk faults in hardware logs. Given that the workaround is to delete the database files and rebuild, and that succeeds immediately, would seem this is more likely some kind of bug in the DB layer that manifests rarely/randomly. Perhaps there is some means to catch this exception and log some extra diagnostics at the very least?
The only time I've seen something like that has turned out to be memory related (which seems super random, I know)
We do have a bunch of tests around the database infra, but we can take another dig and see if theres something in the stack trace.
If you do have some 'downtime' id definitely recommend running a memtest on the box for sanity if it does happen again, just because i've seen that specific set of events before, and it's relatively easy to run a check and see that its functioning correctly...
Noting this just happened again, 26 days after previous occurrence. Same indications. I've not done a memory diagnostic, will still aim to do that even though I can't see the rationale for it, given that the system is otherwise completely free of errors in system logs.
If this memory test comes back clean, is there a path forward?
We can definitely try to replicate it, we've just not seen it. Another option would be switching to use rocksdb potentially, which is just a different database version. There may be more options as well depending on what we find.