rollmint icon indicating copy to clipboard operation
rollmint copied to clipboard

feat: live backups evm and ev-node

Open randygrok opened this issue 2 months ago • 3 comments

Overview

Closes: https://github.com/evstack/ev-node/issues/2641

randygrok avatar Oct 14 '25 15:10 randygrok

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedNov 5, 2025, 8:45 PM

github-actions[bot] avatar Oct 14 '25 15:10 github-actions[bot]

Claude finished @randygrok's task —— View job


Comprehensive Code Review Complete

I've completed a detailed analysis of the backup/restore implementation across all focus areas. Here are my findings:


1. Code Quality & Architecture

✅ Strengths

  • Clean separation of concerns: Backup/restore logic properly separated into pkg/store (core logic), pkg/cmd (CLI), and pkg/rpc (streaming)
  • Idiomatic Go: Good use of interfaces, proper error wrapping, standard library patterns
  • Modular design: New functionality integrates well with existing store interface without breaking changes
  • Streaming architecture: Smart use of streaming RPC for large backup transfers

⚠️ Issues & Recommendations

pkg/store/types.go:75-85 - Interface design concern

type Backup interface {
    Backup(ctx context.Context, writer io.Writer, since uint64) (uint64, error)
    Restore(ctx context.Context, reader io.Reader) error
    Close() error  // ❌ Duplicated from Store interface
}
  • Close() appears in both Backup and Store interfaces. This is redundant since Store embeds Backup
  • Reviewer julienrbrt already noted this shouldn't be in the interface

pkg/cmd/backup.go:20-45 - Flag management

// Flags are defined inline in RunE
outputPath, err := cmd.Flags().GetString("output")
force, err := cmd.Flags().GetBool("force")
sinceVersion, err := cmd.Flags().GetUint64("since-version")
  • Recommendation: Define flags as variables at package level for better testability and clarity (as suggested by julienrbrt)
  • Pattern should be:
const (
    flagOutput = "output"
    flagForce = "force"
    flagSinceVersion = "since-version"
)

pkg/cmd/backup.go:90-92 - Unnecessary nil check

if ctx == nil {
    ctx = context.Background()
}
  • Cobra commands always have a context via cmd.Context()
  • This check is unnecessary (julienrbrt's comment)

pkg/store/restore.go:22 - Type assertion justification

current, ok := any(s.db).(ds.Datastore)
  • The any() cast seems unnecessary since s.db is already ds.Batching which embeds ds.Datastore
  • Question: Why is this needed? (julienrbrt's comment)

2. Error Handling & Resilience

✅ Strengths

  • Proper error wrapping with context using fmt.Errorf with %w
  • Cleanup on error: partial backup files are removed on failure (backup.go:100-104)
  • Stream error handling properly distinguishes context errors from internal errors

⚠️ Issues & Recommendations

pkg/cmd/backup.go:78-111 - Error cleanup can be improved

defer file.Close()  // ❌ Error ignored

// Later in error paths:
_ = writer.Flush()
_ = file.Close()
_ = os.Remove(outputPath)
  • Recommendation: Use defer for cleanup instead of repeating in every error path (julienrbrt's suggestion):
defer func() {
    _ = file.Close()
    if err != nil {
        _ = os.Remove(outputPath)
    }
}()

pkg/rpc/server/server.go:219-263 - Stream context cancellation

func (s *StoreServer) Backup(...) error {
    writer := newBackupStreamWriter(stream, defaultBackupChunkSize)
    version, err := s.store.Backup(ctx, writer, since)
    // What if ctx is canceled during backup?
}
  • Context cancellation during backup isn't explicitly tested
  • Recommendation: Add explicit context check before starting expensive operations
  • Badger's internal backup may handle this, but it's not obvious from the code

pkg/store/backup.go:16-18 - Context validation

if err := ctx.Err(); err != nil {
    return 0, err
}
  • ✅ Good: Context checked at entry
  • ❌ Missing: Long-running backups don't check context during execution
  • Recommendation: Pass context through to Badger operations if supported

pkg/cmd/restore.go:86 - Resource leak risk

defer kvStore.Close()
// ...
defer file.Close()
  • If restore fails, store is closed even if partially written
  • Question: Should we validate the restore completed successfully before closing?
  • Recommendation: Consider keeping store open on error for debugging

3. Security

✅ Strengths

  • Backup files created with 0o600 permissions (backup.go:74) - good security practice
  • No credentials or sensitive data logged
  • RPC endpoints use standard Connect authentication

⚠️ Issues & Recommendations

pkg/cmd/backup.go:66-72 - File overwrite validation

if !force {
    if _, statErr := os.Stat(outputPath); statErr == nil {
        return fmt.Errorf("output file %s already exists (use --force to overwrite)", outputPath)
    }
}
  • ✅ Good: Prevents accidental overwrites
  • ⚠️ Race condition: TOCTOU (Time-of-check-time-of-use) between Stat and OpenFile
  • Recommendation: Use os.O_EXCL flag instead when not forcing:
flags := os.O_CREATE | os.O_WRONLY | os.O_TRUNC
if !force {
    flags |= os.O_EXCL
}
file, err := os.OpenFile(outputPath, flags, 0o600)

pkg/cmd/restore.go:54-65 - Directory deletion security

if force {
    if err := os.RemoveAll(dbPath); err != nil {
        return fmt.Errorf("failed to remove existing datastore: %w", err)
    }
}
  • ✅ Requires explicit --force flag
  • ⚠️ Risk: No validation that dbPath is within expected boundaries
  • Recommendation: Add path validation to prevent accidentally deleting system directories:
if force {
    // Validate dbPath is within rootDir
    if !strings.HasPrefix(dbPath, nodeConfig.RootDir) {
        return fmt.Errorf("datastore path must be within root directory")
    }
    // ... proceed with removal
}

RPC backup endpoint - No authentication mentioned

  • Question: Are backup RPCs authenticated/authorized?
  • Recommendation: Document authentication requirements
  • Backups may contain sensitive application state

4. Performance & Resource Efficiency

✅ Strengths

  • Buffered I/O: 1 MiB buffer for both read and write operations
  • Chunked streaming: 128 KB chunks reduce memory pressure
  • Zero-copy where possible: io.MultiWriter composition

⚠️ Issues & Recommendations

pkg/rpc/server/server.go:267-339 - Memory allocation pattern

func (w *backupStreamWriter) flush() error {
    chunk := make([]byte, len(w.buf))  // ❌ Allocates on every flush
    copy(chunk, w.buf)
    // ...
}
  • Issue: Creates new byte slice on every chunk flush
  • The copy is necessary because gRPC may retain the slice
  • Recommendation: Document why the copy is needed (gRPC retention)
  • Optimization: Consider using sync.Pool for chunk buffers if profiling shows allocation pressure

pkg/cmd/backup.go:80-82 - Buffering strategy

writer := bufio.NewWriterSize(file, 1<<20) // 1 MiB buffer
bytesCount := &countingWriter{}
streamWriter := io.MultiWriter(writer, bytesCount)
  • ✅ Good: Large buffer reduces syscalls
  • Consideration: 1 MiB buffer + 128 KB chunks + badger internal buffers
  • Total memory usage not obvious from code
  • Recommendation: Document expected memory usage for large backups

pkg/store/backup.go:21-35 - Datastore unwrapping

if badgerDatastore, ok := s.db.(*badger4.Datastore); ok {
    return backupBadger(badgerDatastore, writer, since)
}

if shim, ok := s.db.(ds.Shim); ok {
    children := shim.Children()
    if len(children) > 0 {
        if badgerDatastore, ok := children[0].(*badger4.Datastore); ok {
            return backupBadger(badgerDatastore, writer, since)
        }
    }
}
  • ✅ Handles both direct and wrapped datastores
  • ⚠️ Only unwraps one level - deeply nested wrappers not supported
  • Recommendation: Document limitation or add recursive unwrapping

Restore goroutine safety - pkg/store/restore.go

  • Restore modifies datastore while it may be in use
  • Question: Is the node guaranteed to be stopped during restore?
  • Recommendation: Document that restore should only be used on stopped nodes

5. Testing & Reliability

✅ Strengths

  • Unit tests for backup command with mocked RPC server
  • Tests cover happy path and error cases
  • Integration with real Badger datastore tested (store_test.go:1197-1216)

❌ Critical Issues

Test coverage is only 17.66% (317 lines missing coverage):

  • pkg/rpc/server/server.go: 110 lines missing (Backup RPC handler)
  • pkg/cmd/restore.go: 84 lines missing (entire restore command)
  • pkg/store/restore.go: 37 lines missing (restore logic)

Missing test scenarios:

  1. Incremental backups (since-version parameter):

    • No tests verify incremental backup functionality
    • Critical for production usage
  2. Restore validation:

    • No tests for restore command
    • No verification that restored data is valid
    • No tests for restore failure scenarios
  3. Concurrent access:

    • What happens if backup is triggered during block production?
    • What happens if multiple backups run simultaneously?
  4. Large dataset behavior:

    • No tests with large databases
    • Memory usage not validated
    • Stream timeout scenarios not tested
  5. Error recovery:

    • Network interruption during streaming backup
    • Partial file corruption
    • Disk full scenarios

Recommendations:

// Add these test cases:

func TestBackupCmd_IncrementalBackup(t *testing.T) {
    // Test since-version parameter
    // Verify only new data is backed up
}

func TestRestoreCmd_Success(t *testing.T) {
    // Test full restore flow
    // Verify data integrity after restore
}

func TestRestoreCmd_InvalidBackup(t *testing.T) {
    // Test restore with corrupted backup file
}

func TestBackup_StreamInterruption(t *testing.T) {
    // Simulate network failure mid-stream
}

func TestBackup_ConcurrentAccess(t *testing.T) {
    // Backup while store is being written to
}

6. Documentation & Developer Experience

⚠️ Issues & Recommendations

Missing documentation:

  1. No README or guide for backup/restore operations

    • How to perform backups?
    • How to test incremental backups?
    • What's the disaster recovery procedure?
  2. API documentation gaps:

    • Backup interface: When is Close() called?
    • Incremental backup: How does since parameter work?
    • Restore: What happens to existing data?
  3. Operational concerns not documented:

    • Can backup run while node is running?
    • How long do backups take?
    • What's the performance impact?
    • How to validate backup integrity?
  4. Example usage missing:

# Should document these workflows:

# Full backup
./ev-node backup --output=snapshot.badger

# Incremental backup
./ev-node backup --output=incremental.badger --since-version=12345

# Restore
./ev-node restore --input=snapshot.badger --force

# Remote backup
./ev-node backup --evnode.rpc.address=node.example.com:9090

Recommendations:

  • Add docs/backup-restore.md with operational guide
  • Add godoc comments explaining parameters
  • Add examples to command help text
  • Document performance characteristics

Summary of Critical Items

Must Fix Before Merge:

  1. Remove duplicate Close() from Backup interface (pkg/store/types.go:84)
  2. Add restore tests - 84 untested lines is unacceptable for data recovery code
  3. ⚠️ Fix TOCTOU race condition in file overwrite check (pkg/cmd/backup.go:66-72)
  4. ⚠️ Add path validation to prevent accidental directory deletion (pkg/cmd/restore.go:61-64)

Should Fix:

  1. Use defer for cleanup instead of repeating in error paths (pkg/cmd/backup.go)
  2. Define flags as constants for better code organization
  3. Remove unnecessary nil check (pkg/cmd/backup.go:90-92)
  4. Clarify any() type assertion (pkg/store/restore.go:22)

Nice to Have:

  1. Add incremental backup tests
  2. Document operational procedures
  3. Add concurrent access tests
  4. Consider sync.Pool optimization for chunks

Overall Assessment

This is a solid foundation for backup/restore functionality with good architecture and clean implementation. However:

  • Test coverage (17.66%) is insufficient for disaster recovery features
  • Security vulnerabilities (TOCTOU, path traversal) need addressing
  • Operational documentation is completely missing

Recommendation: Request changes to address critical security issues and add restore tests before merging.

claude[bot] avatar Oct 14 '25 15:10 claude[bot]

Codecov Report

:x: Patch coverage is 17.66234% with 317 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 59.55%. Comparing base (de40783) to head (7a31192).

Files with missing lines Patch % Lines
pkg/rpc/server/server.go 0.00% 110 Missing :warning:
pkg/cmd/restore.go 0.00% 84 Missing :warning:
pkg/cmd/backup.go 48.48% 37 Missing and 14 partials :warning:
pkg/store/restore.go 0.00% 37 Missing :warning:
pkg/rpc/client/client.go 35.29% 15 Missing and 7 partials :warning:
pkg/store/backup.go 38.09% 11 Missing and 2 partials :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2758      +/-   ##
==========================================
- Coverage   61.42%   59.55%   -1.88%     
==========================================
  Files          81       85       +4     
  Lines        8622     9007     +385     
==========================================
+ Hits         5296     5364      +68     
- Misses       2828     3122     +294     
- Partials      498      521      +23     
Flag Coverage Δ
combined 59.55% <17.66%> (-1.88%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Oct 14 '25 15:10 codecov[bot]