lighthouse icon indicating copy to clipboard operation
lighthouse copied to clipboard

State Reconstruction reaches a faulty state

Open JustinZal opened this issue 3 years ago • 3 comments

Description

I started my lighthouse client with the --reconstruct-historic-states flag and restarted the client. Once I started the client again, I got an error that did not stop the client, only stopped the state reconstruction:

ERRO State reconstruction failed error: VectorChunkError(Missing { chunk_index: 357 }), service: beacon

Version

Using version 2.5.1-stable (non-portable).

Possible solution

It seems that the failure is observed in the following function:

// Chunks at the end index are included.
// TODO: could be more efficient with a real range query (perhaps RocksDB)
fn range_query<S: KeyValueStore<E>, E: EthSpec, T: Decode + Encode>(
    store: &S,
    column: DBColumn,
    start_index: usize,
    end_index: usize,
) -> Result<Vec<Chunk<T>>, Error> {
    let range = start_index..=end_index;
    let len = range
        .end()
        // Add one to account for inclusive range.
        .saturating_add(1)
        .saturating_sub(*range.start());
    let mut result = Vec::with_capacity(len);

    for chunk_index in range {
        let key = &chunk_key(chunk_index)[..];
        let chunk = Chunk::load(store, column, key)?.ok_or(ChunkError::Missing { chunk_index })?;
        result.push(chunk);
    }

    Ok(result)
}

JustinZal avatar Aug 05 '22 14:08 JustinZal

I think this is an instance of this bug https://github.com/sigp/lighthouse/issues/3011, which I've been trying to solve on and off for a while. The strangest thing about it is that it doesn't seem to happen consistently (and AFAICT there's no obvious flaw in the logic).

Recently @tthebst also had a go at fixing it and found a more reliable way to reproduce it, I need to go and dig in to his changes but haven't had time yet.

Hopefully we'll have this fixed soon. In the meantime you could try running it again and seeing if you can get lucky.

Out of interest, how much free space was available on your disk when you encountered this? I have a suspicion that it's more likely when the disk is close to full (like <40GB free)

michaelsproul avatar Aug 05 '22 22:08 michaelsproul

I think this is an instance of this bug #3011, which I've been trying to solve on and off for a while. The strangest thing about it is that it doesn't seem to happen consistently (and AFAICT there's no obvious flaw in the logic).

Recently @tthebst also had a go at fixing it and found a more reliable way to reproduce it, I need to go and dig in to his changes but haven't had time yet.

Hopefully we'll have this fixed soon. In the meantime you could try running it again and seeing if you can get lucky.

Out of interest, how much free space was available on your disk when you encountered this? I have a suspicion that it's more likely when the disk is close to full (like <40GB free)

Thank you for your answer. Actually, I had over 600GB free on my SSD when this issue occurred. A few details I should probably have mentioned in the description:

  • The issue occurred on 2 of my nodes (on separate servers and separate networks) when I restarted the node process a lot (I was testing a few things), after about 20-30 restarts of running the node for a while and then restarting. Both nodes were running the identical lighthouse version v2.5.1
  • For now, I backed up my current datadir (~150GB) and restarted the sync - I will let you know if anything changes

ghost avatar Aug 05 '22 23:08 ghost

Added some debugging tools in this PR which might help us get an idea of what happened: https://github.com/sigp/lighthouse/pull/3511

michaelsproul avatar Aug 26 '22 07:08 michaelsproul