bee icon indicating copy to clipboard operation
bee copied to clipboard

Bug: Migration step 06 fails on incomplete stamp metadata, causing total data loss

Open crtahlin opened this issue 5 months ago • 2 comments

Summary

Migration step 06 can fail during upgrade to 2.6.0+ when any chunk lacks complete postage stamp metadata, leaving the database in an inconsistent state where all chunks become unreachable despite remaining in storage.

Expected Behavior

  • Migration should handle missing or incomplete stamp metadata gracefully
  • Failed migration should not leave database in broken state
  • Users should be warned about potential data issues before migration runs
  • Rollback mechanism should exist for failed migrations

Current Behavior

Migration fails with recommendation to "nuke" the database when:

  1. Any BatchRadiusItemV1 chunk cannot compute stampHash
  2. chunkstamp.LoadWithBatchID() fails for any chunk
  3. Database left with deleted old indexes but no new indexes created
  4. All data becomes unreachable (both pinned and regular chunks)

Root Cause

Migration step 06 (pkg/storer/migration/step_06.go:112-118) uses atomic approach:

  stamp, err := chunkstamp.LoadWithBatchID(idxStore, "reserve", batchRadiusItemV1.Address, batchRadiusItemV1.BatchID)
  if err != nil {
      return err  // MIGRATION FAILS - database inconsistent
  }
  // Old item already deleted at line 122, new item creation fails

The migration deletes old schema items before ensuring new schema items can be created successfully.

Steps to Reproduce

  1. Run bee node with incomplete stamp metadata (common in older versions)
  2. Upgrade to 2.6.0+
  3. Migration step 06 fails on first chunk with missing stamp data
  4. Observe error: "It's recommended that the nuke cmd is run to reset the node"
  5. Verify chunks exist in storage but cannot be retrieved via any API

Impact

  • Critical: Total data loss requiring database reset
  • Affects: Nodes upgrading from pre-2.6.0 versions
  • User experience: Forced to choose between upgrade and data preservation

Proposed Solutions

Option 1: Graceful Degradation

  • Skip chunks with missing stamp metadata during migration
  • Log warnings for problematic chunks
  • Allow migration to complete with partial data

Option 2: Two-Phase Migration

  • Phase 1: Validate all chunks can be migrated successfully
  • Phase 2: Only proceed with atomic migration if Phase 1 succeeds
  • Provide clear error reporting on validation failures

Option 3: Stamp Reconstruction

  • Attempt to reconstruct missing stamp metadata from available data
  • Use chunk content and batch information to rebuild stamps
  • Fall back to graceful degradation if reconstruction fails

Required Changes

  • Modify pkg/storer/migration/step_06.go migration logic
  • Add pre-migration validation checks
  • Improve error handling and rollback mechanisms
  • Update migration failure messaging with recovery options
  • Add telemetry to track migration success rates

Additional Context

This issue has caused recurring data loss during bee upgrades, particularly affecting nodes with customer data where database reset is not acceptable. The current "nuke database" recommendation is inadequate for production deployments.

Related Issues

  • Pin eviction protection issue (chunks lost due to eviction) #5215
  • Migration should coordinate with pinning system to prioritize preserving pinned data

crtahlin avatar Sep 16 '25 03:09 crtahlin

@crtahlin do you have exact error that migration returned? What exactly do you mean by "incomplete stamp metadata"?

gacevicljubisa avatar Sep 23 '25 07:09 gacevicljubisa

Regarding "incomplete stamp metadata" - this refers to chunks that exist in the reserve storage but lack complete postage stamp information required by the new schema.

Specifically, "incomplete stamp metadata" means:

  1. Missing chunkstamp.Item entries: Chunks have BatchRadiusItemV1 records but no corresponding chunkstamp.Item in the index store
  2. Corrupted stamp data: chunkstamp.Item exists but stamp.Hash() computation fails
  3. Orphaned chunks: Chunks stored before stamp metadata requirements were fully enforced, leaving them without complete stamp records

The migration fails because step 06 requires every chunk in BatchRadiusItemV1 to have valid stamp metadata to compute the new StampHash field. When chunkstamp.LoadWithBatchID(idxStore, "reserve", batchRadiusItemV1.Address, batchRadiusItemV1.BatchID) cannot find or process the stamp data, the entire migration aborts.

This situation likely occurs on nodes that:

  • Operated before strict stamp validation was implemented
  • Experienced previous database inconsistencies
  • Have chunks from batches where stamp metadata was lost or corrupted

The issue affects any node upgrading to 2.6.0+ that has chunks without complete stamp-to-chunk associations in the database.


Specific code versions that would result in missing stamp metadata:

Timeline of changes:

  • Before July 16, 2024 (commit 289f4c88): Chunks could be stored without StampHash requirements
  • July 16, 2024: "feat: store stamp hash" introduced StampHash storage requirement
  • July 25, 2024 (commit dac77a5d): Migration step 06 added to handle the schema change

Specific scenarios causing incomplete stamp metadata:

  1. Pre-July 2024 nodes: Any bee node that stored chunks before commit 289f4c88 (July 16, 2024) would have: - BatchRadiusItemV1 entries for chunks (old schema) - Missing or incomplete chunkstamp.Item entries with proper stamp hash computation
  2. Interrupted storage operations: Nodes that experienced crashes or shutdowns during chunk storage operations between different bee versions
  3. Failed chunk uploads: The commit history shows fixes for "failed uploads" (5fe82a6b) and "overwrite existing stampindex" (aa2746cb), indicating there were periods where stamp metadata could be inconsistent

The root issue: Migration step 06 expects every chunk that has a BatchRadiusItemV1 entry to have complete stamp metadata that can compute a StampHash. But chunks stored before July 2024 were not required to have this metadata, creating the migration failure scenario.

This affects any production bee node that:

  • Operated before July 2024
  • Stored chunks under the old schema
  • Attempts to upgrade to 2.6.0+ (which includes the migration)

crtahlin avatar Sep 23 '25 11:09 crtahlin