teku icon indicating copy to clipboard operation
teku copied to clipboard

Failed to process blob sidecars request

Open NoahStahl opened this issue 3 months ago • 5 comments

I've seen this kind of issue twice in the past month, latest happened earlier this morning. Teku will be running normally then show IO failures in the logs (see below), retry for a while, then apparently shut itself down.

My resolution in these cases has been to manually resync to rebuild the data. Restarting Teku has not worked.

Latest incident was on version 25.7.1. This is running on Windows 10.

Today's log containing failures:

teku-log - failure.txt

For future searchers, some excerpts:

beaconchain-async-13 | ERROR | BlobSidecarsByRangeMessageHandler | Failed to process blob sidecars request

org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: D:\teku\data\beacon\db\305259.sst: Could not create random access file.

CombinedStorageChannel-0 | ERROR | RetryingStorageUpdateChannel | Storage update failed, retrying.

FATAL | teku-status-log | Exiting due to fatal error in RetryingStorageUpdateChannel

NoahStahl avatar Sep 04 '25 13:09 NoahStahl

Thanks for raising. Being unable to create a storage file for the database isn't great - is your disk close to full?

rolfyone avatar Sep 07 '25 23:09 rolfyone

No, plenty of free space and no indication of disk faults in hardware logs. Given that the workaround is to delete the database files and rebuild, and that succeeds immediately, would seem this is more likely some kind of bug in the DB layer that manifests rarely/randomly. Perhaps there is some means to catch this exception and log some extra diagnostics at the very least?

NoahStahl avatar Sep 09 '25 14:09 NoahStahl

The only time I've seen something like that has turned out to be memory related (which seems super random, I know)

We do have a bunch of tests around the database infra, but we can take another dig and see if theres something in the stack trace.

If you do have some 'downtime' id definitely recommend running a memtest on the box for sanity if it does happen again, just because i've seen that specific set of events before, and it's relatively easy to run a check and see that its functioning correctly...

rolfyone avatar Sep 10 '25 02:09 rolfyone

Noting this just happened again, 26 days after previous occurrence. Same indications. I've not done a memory diagnostic, will still aim to do that even though I can't see the rationale for it, given that the system is otherwise completely free of errors in system logs.

If this memory test comes back clean, is there a path forward?

NoahStahl avatar Oct 01 '25 04:10 NoahStahl

We can definitely try to replicate it, we've just not seen it. Another option would be switching to use rocksdb potentially, which is just a different database version. There may be more options as well depending on what we find.

rolfyone avatar Nov 10 '25 02:11 rolfyone