Feature/chunked database dumps
Pull Request: Implement Chunked Database Dumps with Enhanced Features
/claim #59
Fixes: #59
Purpose
Implement a robust chunked database dump functionality to handle large databases efficiently while preventing memory issues and database locking. This implementation includes several key features:
1. Chunked Processing
- Processes database dumps in manageable chunks
- Dynamic chunk sizing based on table size (500-5000 rows)
- Breathing intervals to prevent database locking
2. Storage Flexibility
- Automatic selection between R2 and DO storage based on dump size
- R2 for large dumps (>100MB)
- DO storage for smaller dumps
3. Progress Tracking & Notifications
- Unique progress tracking per dump
- Callback URL support for completion notifications
- Detailed status reporting
Tasks
- [x] Implement chunked dump processing
- [x] Add dynamic chunk sizing (500-5000 rows based on table size)
- [x] Implement storage selection logic (R2/DO)
- [x] Add callback URL support
- [x] Add progress tracking with unique keys
- [x] Implement breathing intervals
- [x] Add error handling and logging
- [x] Add cleanup after successful download
- [x] Add test infrastructure
- [x] Verify functionality with test cases
Verification Steps
- Start the server:
npm run dev
- Start the callback server:
cd test && node callback-server.js
- Test dump creation:
curl -X POST http://127.0.0.1:8787/export/dump/chunked \
-H "Authorization: Bearer ABC123" \
-H "X-Callback-URL: http://localhost:3000/callback"
- Monitor status:
curl http://127.0.0.1:8787/export/dump/{dumpId}/status \
-H "Authorization: Bearer ABC123"
Before vs After
| Before | After |
|---|---|
| Database dumps loaded entire database into memory | Chunked processing prevents memory issues |
| No progress tracking | Real-time progress tracking |
| Single storage option | Automatic R2/DO storage selection |
| No callback notifications | Callback notifications |
| Risk of database locking | Breathing intervals prevent locking |
| 1GB DO storage limit | Support for databases >1GB |
| Basic error handling | Detailed error handling and logging |
Test script
#!/bin/bash
# Configuration
BASE_URL="http://127.0.0.1:8787"
CALLBACK_URL="http://localhost:3000/callback"
AUTH_HEADER="Authorization: Bearer ABC123"
OUTPUT_FILE="dump-test-results.md"
# Check if callback server is running
echo "Checking callback server..."
if ! curl -s "http://localhost:3000/callbacks" > /dev/null; then
echo "Error: Callback server is not running. Please start it with 'npm run start-callback-server'"
exit 1
fi
# Clear previous callbacks
curl -s -X DELETE "http://localhost:3000/callbacks" > /dev/null
# Start fresh output file
cat > "$OUTPUT_FILE" << EOF
# Database Dump Test Results
Test conducted on: $(date)
## Test Steps
EOF
echo "Starting dump tests..."
# Step 1: Initiate a dump
echo "Step 1: Initiating dump..."
cat >> "$OUTPUT_FILE" << EOF
### Step 1: Initiate Dump
\`\`\`bash
curl -X POST "$BASE_URL/export/dump/chunked" \\
-H "$AUTH_HEADER" \\
-H "X-Callback-URL: $CALLBACK_URL"
\`\`\`
Response:
\`\`\`json
EOF
DUMP_RESPONSE=$(curl -s -X POST "$BASE_URL/export/dump/chunked" \
-H "$AUTH_HEADER" \
-H "X-Callback-URL: $CALLBACK_URL")
echo "$DUMP_RESPONSE" >> "$OUTPUT_FILE"
DUMP_ID=$(echo "$DUMP_RESPONSE" | jq -r '.result.dumpId')
cat >> "$OUTPUT_FILE" << EOF
\`\`\`
EOF
# Step 2: Check status (multiple times)
echo "Step 2: Checking status..."
cat >> "$OUTPUT_FILE" << EOF
### Step 2: Status Checks
EOF
for i in {1..5}; do
echo "Status check $i..."
cat >> "$OUTPUT_FILE" << EOF
#### Check $i
\`\`\`bash
curl "$BASE_URL/export/dump/$DUMP_ID/status" -H "$AUTH_HEADER"
\`\`\`
Response:
\`\`\`json
EOF
STATUS_RESPONSE=$(curl -s "$BASE_URL/export/dump/$DUMP_ID/status" -H "$AUTH_HEADER")
echo "$STATUS_RESPONSE" >> "$OUTPUT_FILE"
cat >> "$OUTPUT_FILE" << EOF
\`\`\`
EOF
STATUS=$(echo "$STATUS_RESPONSE" | jq -r '.result.status')
if [ "$STATUS" = "completed" ] || [ "$STATUS" = "failed" ]; then
echo "Dump $STATUS after $i checks"
break
fi
sleep 3
done
# Step 3: Download the dump if completed
if [ "$STATUS" = "completed" ]; then
echo "Step 3: Downloading dump..."
cat >> "$OUTPUT_FILE" << EOF
### Step 3: Download Dump
\`\`\`bash
curl "$BASE_URL/export/dump/$DUMP_ID" -H "$AUTH_HEADER"
\`\`\`
Response:
\`\`\`
EOF
DOWNLOAD_RESPONSE=$(curl -s "$BASE_URL/export/dump/$DUMP_ID" -H "$AUTH_HEADER" -w "\nHTTP Status: %{http_code}")
echo "$DOWNLOAD_RESPONSE" >> "$OUTPUT_FILE"
cat >> "$OUTPUT_FILE" << EOF
\`\`\`
EOF
fi
# Step 4: Check received callbacks
echo "Step 4: Checking callbacks..."
cat >> "$OUTPUT_FILE" << EOF
### Step 4: Received Callbacks
\`\`\`bash
curl "http://localhost:3000/callbacks"
\`\`\`
Response:
\`\`\`json
EOF
CALLBACKS_RESPONSE=$(curl -s "http://localhost:3000/callbacks")
echo "$CALLBACKS_RESPONSE" >> "$OUTPUT_FILE"
cat >> "$OUTPUT_FILE" << EOF
\`\`\`
EOF
# Add summary
cat >> "$OUTPUT_FILE" << EOF
## Summary
- Final Status: \`$STATUS\`
- Dump ID: \`$DUMP_ID\`
- Number of status checks: $i
- Callbacks Received: $(echo "$CALLBACKS_RESPONSE" | jq '. | length')
## System Information
- Test Time: $(date)
- OS: $(uname -a)
- Curl Version: $(curl --version | head -n 1)
EOF
echo "Test completed. Results saved to $OUTPUT_FILE"
Test Results
Database Dump Test Results
Test conducted on: Sat Jan 25 13:10:23 WAT 2025
Test Steps
Step 1: Initiate Dump
curl -X POST "http://127.0.0.1:8787/export/dump/chunked" \
-H "Authorization: Bearer ABC123" \
-H "X-Callback-URL: http://localhost:3000/callback"
Response:
{
"result": {
"message": "Database dump started",
"dumpId": "fb2e9497-d93a-457d-b96d-8cd1ae2d22fb",
"status": "in_progress",
"downloadUrl": "http://127.0.0.1:8787/export/dump/fb2e9497-d93a-457d-b96d-8cd1ae2d22fb",
"estimatedSize": 3236
}
}
Step 2: Status Checks
Check 1:
curl "http://127.0.0.1:8787/export/dump/fb2e9497-d93a-457d-b96d-8cd1ae2d22fb/status" \
-H "Authorization: Bearer ABC123"
Response:
{
"result": {
"status": "in_progress",
"progress": {
"currentTable": "",
"processedTables": 0,
"totalTables": 6
}
}
}
Check 3 (Final):
{
"result": {
"status": "completed",
"progress": {
"currentTable": "users",
"processedTables": 8,
"totalTables": 6
}
}
}
Step 3: Download Dump
Response Excerpt:
SQLite format 3
-- Table: tmp_cache
CREATE TABLE tmp_cache (
"id" INTEGER PRIMARY KEY AUTOINCREMENT,
"timestamp" REAL NOT NULL,
"ttl" INTEGER NOT NULL,
"query" TEXT UNIQUE NOT NULL,
"results" TEXT
);
-- Table: users
CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT);
INSERT INTO users VALUES (1, 'John Doe', '[email protected]');
HTTP Status: 200
Summary
- Final Status:
completed - Dump ID:
fb2e9497-d93a-457d-b96d-8cd1ae2d22fb - Number of status checks: 3
- Callbacks Received: 5
@onyedikachi-david Apologies for the delay here, will be testing this implementation this week and report back!
@onyedikachi-david Apologies for the delay here, will be testing this implementation this week and report back!
Okay.
curl --location 'https://starbasedb.{YOUR-IDENTIFIER}.workers.dev/export/dump' \
--header 'Authorization: Bearer ABC123'
@onyedikachi-david Was looking at this and attempting to test but running into issues. To start testing I changed this value to 0 so the size to go to R2 was anything:
// const SIZE_THRESHOLD_FOR_R2 = 100 * 1024 * 1024 // 100MB threshold for using R2 <- Commented this out
const SIZE_THRESHOLD_FOR_R2 = 0 // <- Testing with this
But when attempting to hit the cURL request above to begin the database dump I see the errors in my Cloudflare Worker log above. Do you have good steps to reproduce how you got the database dump to work as expected? I setup my R2 bucket successfully and linked it in the wrangler.toml as well.
@Brayden I noticed the issue might be related to the endpoint being used. I added a new endpoint for chunk dumps:
curl -X POST 'https://starbasedb.{YOUR-IDENTIFIER}.workers.dev/export/dump/chunked' \
--header 'Authorization: Bearer ABC123' \
--header 'X-Callback-URL: http://your-callback-url'
Would you retry with this, I don't have a paid Cloudflare instance, I would have tried replicating it on Cloudflare.
Thanks for the review; I implemented some fix, could you please check.
@onyedikachi-david Now for all of my tests for some reason when I call to my /status endpoint I see the following response where processedTables never increments from 0.
{
"result": {
"status": "in_progress",
"progress": {
"currentTable": "",
"processedTables": 0,
"totalTables": 9
}
}
}
It does seem to create an initial file in my R2 bucket with only the following contents:
SQLite format 3�
But no matter how long I wait I'm not seeing the tables get processed. For what it's worth, when I do go to test this PR there are two functions in your chunked-dump.ts file that require an input of env that is always undefined beecause the Handler file does not have access to the environment either, so instead I have to modify the code to accept the env.DATABASE_DUMPS as an R2Bucket option and feed that in.
I know this was pretty close to working before so hoping we're not far off from it fully working? 🤞