lighthouse
lighthouse copied to clipboard
State Reconstruction reaches a faulty state
Description
I started my lighthouse client with the --reconstruct-historic-states flag and restarted the client. Once I started the client again, I got an error that did not stop the client, only stopped the state reconstruction:
ERRO State reconstruction failed error: VectorChunkError(Missing { chunk_index: 357 }), service: beacon
Version
Using version 2.5.1-stable (non-portable).
Possible solution
It seems that the failure is observed in the following function:
// Chunks at the end index are included.
// TODO: could be more efficient with a real range query (perhaps RocksDB)
fn range_query<S: KeyValueStore<E>, E: EthSpec, T: Decode + Encode>(
store: &S,
column: DBColumn,
start_index: usize,
end_index: usize,
) -> Result<Vec<Chunk<T>>, Error> {
let range = start_index..=end_index;
let len = range
.end()
// Add one to account for inclusive range.
.saturating_add(1)
.saturating_sub(*range.start());
let mut result = Vec::with_capacity(len);
for chunk_index in range {
let key = &chunk_key(chunk_index)[..];
let chunk = Chunk::load(store, column, key)?.ok_or(ChunkError::Missing { chunk_index })?;
result.push(chunk);
}
Ok(result)
}
I think this is an instance of this bug https://github.com/sigp/lighthouse/issues/3011, which I've been trying to solve on and off for a while. The strangest thing about it is that it doesn't seem to happen consistently (and AFAICT there's no obvious flaw in the logic).
Recently @tthebst also had a go at fixing it and found a more reliable way to reproduce it, I need to go and dig in to his changes but haven't had time yet.
Hopefully we'll have this fixed soon. In the meantime you could try running it again and seeing if you can get lucky.
Out of interest, how much free space was available on your disk when you encountered this? I have a suspicion that it's more likely when the disk is close to full (like <40GB free)
I think this is an instance of this bug #3011, which I've been trying to solve on and off for a while. The strangest thing about it is that it doesn't seem to happen consistently (and AFAICT there's no obvious flaw in the logic).
Recently @tthebst also had a go at fixing it and found a more reliable way to reproduce it, I need to go and dig in to his changes but haven't had time yet.
Hopefully we'll have this fixed soon. In the meantime you could try running it again and seeing if you can get lucky.
Out of interest, how much free space was available on your disk when you encountered this? I have a suspicion that it's more likely when the disk is close to full (like <40GB free)
Thank you for your answer. Actually, I had over 600GB free on my SSD when this issue occurred. A few details I should probably have mentioned in the description:
- The issue occurred on 2 of my nodes (on separate servers and separate networks) when I restarted the node process a lot (I was testing a few things), after about 20-30 restarts of running the node for a while and then restarting. Both nodes were running the identical lighthouse version
v2.5.1 - For now, I backed up my current datadir (~150GB) and restarted the sync - I will let you know if anything changes
Added some debugging tools in this PR which might help us get an idea of what happened: https://github.com/sigp/lighthouse/pull/3511