feat: live backups evm and ev-node
Overview
Closes: https://github.com/evstack/ev-node/issues/2641
The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).
| Build | Format | Lint | Breaking | Updated (UTC) |
|---|---|---|---|---|
✅ passed | ⏩ skipped | ✅ passed | ✅ passed | Nov 5, 2025, 8:45 PM |
Claude finished @randygrok's task —— View job
Comprehensive Code Review Complete
I've completed a detailed analysis of the backup/restore implementation across all focus areas. Here are my findings:
1. Code Quality & Architecture
✅ Strengths
- Clean separation of concerns: Backup/restore logic properly separated into
pkg/store(core logic),pkg/cmd(CLI), andpkg/rpc(streaming) - Idiomatic Go: Good use of interfaces, proper error wrapping, standard library patterns
- Modular design: New functionality integrates well with existing store interface without breaking changes
- Streaming architecture: Smart use of streaming RPC for large backup transfers
⚠️ Issues & Recommendations
pkg/store/types.go:75-85 - Interface design concern
type Backup interface {
Backup(ctx context.Context, writer io.Writer, since uint64) (uint64, error)
Restore(ctx context.Context, reader io.Reader) error
Close() error // ❌ Duplicated from Store interface
}
Close()appears in bothBackupandStoreinterfaces. This is redundant sinceStoreembedsBackup- Reviewer julienrbrt already noted this shouldn't be in the interface
pkg/cmd/backup.go:20-45 - Flag management
// Flags are defined inline in RunE
outputPath, err := cmd.Flags().GetString("output")
force, err := cmd.Flags().GetBool("force")
sinceVersion, err := cmd.Flags().GetUint64("since-version")
- Recommendation: Define flags as variables at package level for better testability and clarity (as suggested by julienrbrt)
- Pattern should be:
const (
flagOutput = "output"
flagForce = "force"
flagSinceVersion = "since-version"
)
pkg/cmd/backup.go:90-92 - Unnecessary nil check
if ctx == nil {
ctx = context.Background()
}
- Cobra commands always have a context via
cmd.Context() - This check is unnecessary (julienrbrt's comment)
pkg/store/restore.go:22 - Type assertion justification
current, ok := any(s.db).(ds.Datastore)
- The
any()cast seems unnecessary sinces.dbis alreadyds.Batchingwhich embedsds.Datastore - Question: Why is this needed? (julienrbrt's comment)
2. Error Handling & Resilience
✅ Strengths
- Proper error wrapping with context using
fmt.Errorfwith%w - Cleanup on error: partial backup files are removed on failure (backup.go:100-104)
- Stream error handling properly distinguishes context errors from internal errors
⚠️ Issues & Recommendations
pkg/cmd/backup.go:78-111 - Error cleanup can be improved
defer file.Close() // ❌ Error ignored
// Later in error paths:
_ = writer.Flush()
_ = file.Close()
_ = os.Remove(outputPath)
- Recommendation: Use
deferfor cleanup instead of repeating in every error path (julienrbrt's suggestion):
defer func() {
_ = file.Close()
if err != nil {
_ = os.Remove(outputPath)
}
}()
pkg/rpc/server/server.go:219-263 - Stream context cancellation
func (s *StoreServer) Backup(...) error {
writer := newBackupStreamWriter(stream, defaultBackupChunkSize)
version, err := s.store.Backup(ctx, writer, since)
// What if ctx is canceled during backup?
}
- Context cancellation during backup isn't explicitly tested
- Recommendation: Add explicit context check before starting expensive operations
- Badger's internal backup may handle this, but it's not obvious from the code
pkg/store/backup.go:16-18 - Context validation
if err := ctx.Err(); err != nil {
return 0, err
}
- ✅ Good: Context checked at entry
- ❌ Missing: Long-running backups don't check context during execution
- Recommendation: Pass context through to Badger operations if supported
pkg/cmd/restore.go:86 - Resource leak risk
defer kvStore.Close()
// ...
defer file.Close()
- If restore fails, store is closed even if partially written
- Question: Should we validate the restore completed successfully before closing?
- Recommendation: Consider keeping store open on error for debugging
3. Security
✅ Strengths
- Backup files created with
0o600permissions (backup.go:74) - good security practice - No credentials or sensitive data logged
- RPC endpoints use standard Connect authentication
⚠️ Issues & Recommendations
pkg/cmd/backup.go:66-72 - File overwrite validation
if !force {
if _, statErr := os.Stat(outputPath); statErr == nil {
return fmt.Errorf("output file %s already exists (use --force to overwrite)", outputPath)
}
}
- ✅ Good: Prevents accidental overwrites
- ⚠️ Race condition: TOCTOU (Time-of-check-time-of-use) between Stat and OpenFile
- Recommendation: Use
os.O_EXCLflag instead when not forcing:
flags := os.O_CREATE | os.O_WRONLY | os.O_TRUNC
if !force {
flags |= os.O_EXCL
}
file, err := os.OpenFile(outputPath, flags, 0o600)
pkg/cmd/restore.go:54-65 - Directory deletion security
if force {
if err := os.RemoveAll(dbPath); err != nil {
return fmt.Errorf("failed to remove existing datastore: %w", err)
}
}
- ✅ Requires explicit
--forceflag - ⚠️ Risk: No validation that
dbPathis within expected boundaries - Recommendation: Add path validation to prevent accidentally deleting system directories:
if force {
// Validate dbPath is within rootDir
if !strings.HasPrefix(dbPath, nodeConfig.RootDir) {
return fmt.Errorf("datastore path must be within root directory")
}
// ... proceed with removal
}
RPC backup endpoint - No authentication mentioned
- Question: Are backup RPCs authenticated/authorized?
- Recommendation: Document authentication requirements
- Backups may contain sensitive application state
4. Performance & Resource Efficiency
✅ Strengths
- Buffered I/O: 1 MiB buffer for both read and write operations
- Chunked streaming: 128 KB chunks reduce memory pressure
- Zero-copy where possible:
io.MultiWritercomposition
⚠️ Issues & Recommendations
pkg/rpc/server/server.go:267-339 - Memory allocation pattern
func (w *backupStreamWriter) flush() error {
chunk := make([]byte, len(w.buf)) // ❌ Allocates on every flush
copy(chunk, w.buf)
// ...
}
- Issue: Creates new byte slice on every chunk flush
- The copy is necessary because gRPC may retain the slice
- Recommendation: Document why the copy is needed (gRPC retention)
- Optimization: Consider using
sync.Poolfor chunk buffers if profiling shows allocation pressure
pkg/cmd/backup.go:80-82 - Buffering strategy
writer := bufio.NewWriterSize(file, 1<<20) // 1 MiB buffer
bytesCount := &countingWriter{}
streamWriter := io.MultiWriter(writer, bytesCount)
- ✅ Good: Large buffer reduces syscalls
- Consideration: 1 MiB buffer + 128 KB chunks + badger internal buffers
- Total memory usage not obvious from code
- Recommendation: Document expected memory usage for large backups
pkg/store/backup.go:21-35 - Datastore unwrapping
if badgerDatastore, ok := s.db.(*badger4.Datastore); ok {
return backupBadger(badgerDatastore, writer, since)
}
if shim, ok := s.db.(ds.Shim); ok {
children := shim.Children()
if len(children) > 0 {
if badgerDatastore, ok := children[0].(*badger4.Datastore); ok {
return backupBadger(badgerDatastore, writer, since)
}
}
}
- ✅ Handles both direct and wrapped datastores
- ⚠️ Only unwraps one level - deeply nested wrappers not supported
- Recommendation: Document limitation or add recursive unwrapping
Restore goroutine safety - pkg/store/restore.go
- Restore modifies datastore while it may be in use
- Question: Is the node guaranteed to be stopped during restore?
- Recommendation: Document that restore should only be used on stopped nodes
5. Testing & Reliability
✅ Strengths
- Unit tests for backup command with mocked RPC server
- Tests cover happy path and error cases
- Integration with real Badger datastore tested (store_test.go:1197-1216)
❌ Critical Issues
Test coverage is only 17.66% (317 lines missing coverage):
pkg/rpc/server/server.go: 110 lines missing (Backup RPC handler)pkg/cmd/restore.go: 84 lines missing (entire restore command)pkg/store/restore.go: 37 lines missing (restore logic)
Missing test scenarios:
-
Incremental backups (
since-versionparameter):- No tests verify incremental backup functionality
- Critical for production usage
-
Restore validation:
- No tests for restore command
- No verification that restored data is valid
- No tests for restore failure scenarios
-
Concurrent access:
- What happens if backup is triggered during block production?
- What happens if multiple backups run simultaneously?
-
Large dataset behavior:
- No tests with large databases
- Memory usage not validated
- Stream timeout scenarios not tested
-
Error recovery:
- Network interruption during streaming backup
- Partial file corruption
- Disk full scenarios
Recommendations:
// Add these test cases:
func TestBackupCmd_IncrementalBackup(t *testing.T) {
// Test since-version parameter
// Verify only new data is backed up
}
func TestRestoreCmd_Success(t *testing.T) {
// Test full restore flow
// Verify data integrity after restore
}
func TestRestoreCmd_InvalidBackup(t *testing.T) {
// Test restore with corrupted backup file
}
func TestBackup_StreamInterruption(t *testing.T) {
// Simulate network failure mid-stream
}
func TestBackup_ConcurrentAccess(t *testing.T) {
// Backup while store is being written to
}
6. Documentation & Developer Experience
⚠️ Issues & Recommendations
Missing documentation:
-
No README or guide for backup/restore operations
- How to perform backups?
- How to test incremental backups?
- What's the disaster recovery procedure?
-
API documentation gaps:
Backupinterface: When isClose()called?- Incremental backup: How does
sinceparameter work? - Restore: What happens to existing data?
-
Operational concerns not documented:
- Can backup run while node is running?
- How long do backups take?
- What's the performance impact?
- How to validate backup integrity?
-
Example usage missing:
# Should document these workflows:
# Full backup
./ev-node backup --output=snapshot.badger
# Incremental backup
./ev-node backup --output=incremental.badger --since-version=12345
# Restore
./ev-node restore --input=snapshot.badger --force
# Remote backup
./ev-node backup --evnode.rpc.address=node.example.com:9090
Recommendations:
- Add
docs/backup-restore.mdwith operational guide - Add godoc comments explaining parameters
- Add examples to command help text
- Document performance characteristics
Summary of Critical Items
Must Fix Before Merge:
- ❌ Remove duplicate
Close()from Backup interface (pkg/store/types.go:84) - ❌ Add restore tests - 84 untested lines is unacceptable for data recovery code
- ⚠️ Fix TOCTOU race condition in file overwrite check (pkg/cmd/backup.go:66-72)
- ⚠️ Add path validation to prevent accidental directory deletion (pkg/cmd/restore.go:61-64)
Should Fix:
- Use defer for cleanup instead of repeating in error paths (pkg/cmd/backup.go)
- Define flags as constants for better code organization
- Remove unnecessary nil check (pkg/cmd/backup.go:90-92)
- Clarify
any()type assertion (pkg/store/restore.go:22)
Nice to Have:
- Add incremental backup tests
- Document operational procedures
- Add concurrent access tests
- Consider
sync.Pooloptimization for chunks
Overall Assessment
This is a solid foundation for backup/restore functionality with good architecture and clean implementation. However:
- Test coverage (17.66%) is insufficient for disaster recovery features
- Security vulnerabilities (TOCTOU, path traversal) need addressing
- Operational documentation is completely missing
Recommendation: Request changes to address critical security issues and add restore tests before merging.
Codecov Report
:x: Patch coverage is 17.66234% with 317 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 59.55%. Comparing base (de40783) to head (7a31192).
Additional details and impacted files
@@ Coverage Diff @@
## main #2758 +/- ##
==========================================
- Coverage 61.42% 59.55% -1.88%
==========================================
Files 81 85 +4
Lines 8622 9007 +385
==========================================
+ Hits 5296 5364 +68
- Misses 2828 3122 +294
- Partials 498 521 +23
| Flag | Coverage Δ | |
|---|---|---|
| combined | 59.55% <17.66%> (-1.88%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.