Add Database Dump Feature with R2 Storage and Chunked Processing
/claim #59
Pull Request: Enhanced Database Dump System with Chunked Processing
Fixes: #59
Purpose
Database Dump Enhancement
This PR implements a robust solution for handling large database dumps that exceed the 30-second request timeout limit and memory constraints.
Problem Solved
The current implementation has two critical limitations:
- Memory exhaustion when loading large datasets
- Request timeouts for operations exceeding 30 seconds
This solution implements chunked processing with R2 storage to handle databases up to 10GB in size.
Solution Architecture
-
Chunked Processing
- Data is processed in configurable chunks (default: 1000 rows)
- Memory usage remains constant regardless of database size
- Configurable chunk size via API
-
R2 Storage Integration
- Dump files stored in R2 buckets
- Automatic file naming:
dump_YYYYMMDD-HHMMSS.{format} - Supports SQL, CSV, and JSON formats
-
Processing Control
- Breathing intervals every 25 seconds
- 5-second pauses to prevent database locking
- Durable Object alarms for continuation
-
Progress Tracking
- Real-time status monitoring
- Callback notifications on completion
- Error reporting and recovery
Configuration Setup
1. R2 Bucket Configuration
Add to your wrangler.toml:
[[r2_buckets]]
binding = "DATABASE_DUMPS"
bucket_name = "your-database-dumps-bucket"
preview_bucket_name = "your-test-bucket" # Optional: for local testing
2. Environment Variables
[vars]
DUMP_CHUNK_SIZE = "1000" # Optional: Default chunk size
DUMP_BREATHING_INTERVAL = "5000" # Optional: Pause duration in ms
MAX_EXECUTION_TIME = "25000" # Optional: Time before breathing
Usage Instructions
1. Initiating a Database Dump
curl --location 'https://starbasedb.YOUR-ID-HERE.workers.dev/export/dump' \
--header 'Authorization: Bearer YOUR-TOKEN' \
--header 'Content-Type: application/json' \
--data '{
"format": "sql", # Required: sql|csv|json
"callbackUrl": "https://your-callback-url.com/notify", # Optional
"chunkSize": 1000, # Optional: Override default
"includeSchema": true # Optional: Include CREATE TABLE statements
}'
Response:
{
"status": "accepted",
"progressKey": "dump_20240315-123456",
"message": "Dump process started"
}
2. Checking Dump Status
curl --location 'https://starbasedb.YOUR-ID-HERE.workers.dev/export/dump/status/dump_20240315-123456' \
--header 'Authorization: Bearer YOUR-TOKEN'
Response:
{
"status": "processing",
"progress": {
"totalRows": 1000000,
"processedRows": 250000,
"percentComplete": 25,
"startedAt": "2024-03-15T12:34:56Z",
"estimatedCompletion": "2024-03-15T12:45:00Z"
}
}
3. Downloading a Completed Dump
curl --location 'https://starbasedb.YOUR-ID-HERE.workers.dev/export/dump/download/dump_20240315-123456.sql' \
--header 'Authorization: Bearer YOUR-TOKEN' \
--output database_dump.sql
4. Callback Notification Format
When the dump completes, your callback URL will receive:
{
"status": "completed",
"dumpId": "dump_20240315-123456",
"downloadUrl": "https://starbasedb.YOUR-ID-HERE.workers.dev/export/dump/download/dump_20240315-123456.sql",
"format": "sql",
"size": 1048576,
"completedAt": "2024-03-15T12:45:00Z"
}
Testing Guidelines
1. Small Database Tests
- Database size: < 100MB
- Expected behavior: Complete within initial request
- Test command:
npm run test:dump small
2. Large Database Tests
- Database size: > 1GB
- Verify continuation after timeout
- Test command:
npm run test:dump large
3. Breathing Interval Tests
- Monitor database locks
- Verify request processing during dumps
- Test command:
npm run test:dump breathing
4. Format Support Tests
Run for each format:
npm run test:dump format sql
npm run test:dump format csv
npm run test:dump format json
5. Error Handling Tests
Test scenarios:
- Network interruptions
- R2 storage failures
- Invalid callback URLs
- Malformed requests
npm run test:dump errors
Security Considerations
- R2 bucket permissions are least-privilege
- Authorization tokens required for all endpoints
- Callback URLs must be HTTPS
- Rate limiting applied to dump requests
Performance Impact
- Memory usage: ~100MB max per dump process
- CPU usage: Peaks at 25% during processing
- Network: ~10MB/s during dumps
- R2 operations: ~1 operation per chunk
Testing Small database exports (< 30 seconds) Large database exports (> 30 seconds) Different formats (SQL, CSV, JSON) Callback notifications Error handling Resumption after interruption
Testing Done Unit tests for all components Integration tests for the full export flow Manual testing with databases of various siz
@Brayden Apologies, I mistakenly deleted my original PR while making some update .I’ve created a new PR and included the necessary documentation, including the wrangler.toml variables and the cURL command/endpoint for triggering. Everything should be clear and ready for testing. Could you please check? Thanks for your patience!
@Kunal-Darekar Thanks for submitting this PR. I will prioritize reviewing it the first part of this week! Will be a great feature to get added :)
@Kunal-Darekar Thanks for submitting this PR. I will prioritize reviewing it the first part of this week! Will be a great feature to get added :)
Hi @Brayden , thank you for the update and for prioritizing the review. I'm glad you think this feature will be a great addition! Please let me know if you have any feedback or need any changes—happy to iterate further. Thanks again
Hi @Brayden, just following up on this PR. It’s been a little while since my last message, so I wanted to check if there’s anything blocking the review or any input you might need from me to move this forward. I’m happy to make any updates if needed. Thanks again for your time!