starbasedb icon indicating copy to clipboard operation
starbasedb copied to clipboard

Feature/chunked database dumps

Open onyedikachi-david opened this issue 11 months ago • 6 comments

Pull Request: Implement Chunked Database Dumps with Enhanced Features

/claim #59
Fixes: #59

Purpose

Implement a robust chunked database dump functionality to handle large databases efficiently while preventing memory issues and database locking. This implementation includes several key features:

1. Chunked Processing

  • Processes database dumps in manageable chunks
  • Dynamic chunk sizing based on table size (500-5000 rows)
  • Breathing intervals to prevent database locking

2. Storage Flexibility

  • Automatic selection between R2 and DO storage based on dump size
  • R2 for large dumps (>100MB)
  • DO storage for smaller dumps

3. Progress Tracking & Notifications

  • Unique progress tracking per dump
  • Callback URL support for completion notifications
  • Detailed status reporting

Tasks

  • [x] Implement chunked dump processing
  • [x] Add dynamic chunk sizing (500-5000 rows based on table size)
  • [x] Implement storage selection logic (R2/DO)
  • [x] Add callback URL support
  • [x] Add progress tracking with unique keys
  • [x] Implement breathing intervals
  • [x] Add error handling and logging
  • [x] Add cleanup after successful download
  • [x] Add test infrastructure
  • [x] Verify functionality with test cases

Verification Steps

  1. Start the server:
npm run dev
  1. Start the callback server:
cd test && node callback-server.js
  1. Test dump creation:
curl -X POST http://127.0.0.1:8787/export/dump/chunked \
  -H "Authorization: Bearer ABC123" \
  -H "X-Callback-URL: http://localhost:3000/callback"
  1. Monitor status:
curl http://127.0.0.1:8787/export/dump/{dumpId}/status \
  -H "Authorization: Bearer ABC123"

Before vs After

Before After
Database dumps loaded entire database into memory Chunked processing prevents memory issues
No progress tracking Real-time progress tracking
Single storage option Automatic R2/DO storage selection
No callback notifications Callback notifications
Risk of database locking Breathing intervals prevent locking
1GB DO storage limit Support for databases >1GB
Basic error handling Detailed error handling and logging
Test script
#!/bin/bash

# Configuration
BASE_URL="http://127.0.0.1:8787"
CALLBACK_URL="http://localhost:3000/callback"
AUTH_HEADER="Authorization: Bearer ABC123"
OUTPUT_FILE="dump-test-results.md"

# Check if callback server is running
echo "Checking callback server..."
if ! curl -s "http://localhost:3000/callbacks" > /dev/null; then
    echo "Error: Callback server is not running. Please start it with 'npm run start-callback-server'"
    exit 1
fi

# Clear previous callbacks
curl -s -X DELETE "http://localhost:3000/callbacks" > /dev/null

# Start fresh output file
cat > "$OUTPUT_FILE" << EOF
# Database Dump Test Results
Test conducted on: $(date)

## Test Steps

EOF

echo "Starting dump tests..."

# Step 1: Initiate a dump
echo "Step 1: Initiating dump..."
cat >> "$OUTPUT_FILE" << EOF
### Step 1: Initiate Dump
\`\`\`bash
curl -X POST "$BASE_URL/export/dump/chunked" \\
    -H "$AUTH_HEADER" \\
    -H "X-Callback-URL: $CALLBACK_URL"
\`\`\`

Response:
\`\`\`json
EOF

DUMP_RESPONSE=$(curl -s -X POST "$BASE_URL/export/dump/chunked" \
    -H "$AUTH_HEADER" \
    -H "X-Callback-URL: $CALLBACK_URL")
echo "$DUMP_RESPONSE" >> "$OUTPUT_FILE"
DUMP_ID=$(echo "$DUMP_RESPONSE" | jq -r '.result.dumpId')

cat >> "$OUTPUT_FILE" << EOF
\`\`\`

EOF

# Step 2: Check status (multiple times)
echo "Step 2: Checking status..."
cat >> "$OUTPUT_FILE" << EOF
### Step 2: Status Checks

EOF

for i in {1..5}; do
    echo "Status check $i..."
    cat >> "$OUTPUT_FILE" << EOF
#### Check $i
\`\`\`bash
curl "$BASE_URL/export/dump/$DUMP_ID/status" -H "$AUTH_HEADER"
\`\`\`

Response:
\`\`\`json
EOF

    STATUS_RESPONSE=$(curl -s "$BASE_URL/export/dump/$DUMP_ID/status" -H "$AUTH_HEADER")
    echo "$STATUS_RESPONSE" >> "$OUTPUT_FILE"
    
    cat >> "$OUTPUT_FILE" << EOF
\`\`\`

EOF
    
    STATUS=$(echo "$STATUS_RESPONSE" | jq -r '.result.status')
    if [ "$STATUS" = "completed" ] || [ "$STATUS" = "failed" ]; then
        echo "Dump $STATUS after $i checks"
        break
    fi
    sleep 3
done

# Step 3: Download the dump if completed
if [ "$STATUS" = "completed" ]; then
    echo "Step 3: Downloading dump..."
    cat >> "$OUTPUT_FILE" << EOF
### Step 3: Download Dump
\`\`\`bash
curl "$BASE_URL/export/dump/$DUMP_ID" -H "$AUTH_HEADER"
\`\`\`

Response:
\`\`\`
EOF

    DOWNLOAD_RESPONSE=$(curl -s "$BASE_URL/export/dump/$DUMP_ID" -H "$AUTH_HEADER" -w "\nHTTP Status: %{http_code}")
    echo "$DOWNLOAD_RESPONSE" >> "$OUTPUT_FILE"

    cat >> "$OUTPUT_FILE" << EOF
\`\`\`

EOF
fi

# Step 4: Check received callbacks
echo "Step 4: Checking callbacks..."
cat >> "$OUTPUT_FILE" << EOF
### Step 4: Received Callbacks
\`\`\`bash
curl "http://localhost:3000/callbacks"
\`\`\`

Response:
\`\`\`json
EOF

CALLBACKS_RESPONSE=$(curl -s "http://localhost:3000/callbacks")
echo "$CALLBACKS_RESPONSE" >> "$OUTPUT_FILE"

cat >> "$OUTPUT_FILE" << EOF
\`\`\`

EOF

# Add summary
cat >> "$OUTPUT_FILE" << EOF
## Summary
- Final Status: \`$STATUS\`
- Dump ID: \`$DUMP_ID\`
- Number of status checks: $i
- Callbacks Received: $(echo "$CALLBACKS_RESPONSE" | jq '. | length')

## System Information
- Test Time: $(date)
- OS: $(uname -a)
- Curl Version: $(curl --version | head -n 1)
EOF

echo "Test completed. Results saved to $OUTPUT_FILE" 
Test Results

Database Dump Test Results

Test conducted on: Sat Jan 25 13:10:23 WAT 2025

Test Steps

Step 1: Initiate Dump

curl -X POST "http://127.0.0.1:8787/export/dump/chunked" \
    -H "Authorization: Bearer ABC123" \
    -H "X-Callback-URL: http://localhost:3000/callback"

Response:

{
  "result": {
    "message": "Database dump started",
    "dumpId": "fb2e9497-d93a-457d-b96d-8cd1ae2d22fb",
    "status": "in_progress",
    "downloadUrl": "http://127.0.0.1:8787/export/dump/fb2e9497-d93a-457d-b96d-8cd1ae2d22fb",
    "estimatedSize": 3236
  }
}

Step 2: Status Checks

Check 1:

curl "http://127.0.0.1:8787/export/dump/fb2e9497-d93a-457d-b96d-8cd1ae2d22fb/status" \
  -H "Authorization: Bearer ABC123"

Response:

{
  "result": {
    "status": "in_progress",
    "progress": {
      "currentTable": "",
      "processedTables": 0,
      "totalTables": 6
    }
  }
}

Check 3 (Final):

{
  "result": {
    "status": "completed",
    "progress": {
      "currentTable": "users",
      "processedTables": 8,
      "totalTables": 6
    }
  }
}

Step 3: Download Dump

Response Excerpt:

SQLite format 3
-- Table: tmp_cache
CREATE TABLE tmp_cache (
    "id" INTEGER PRIMARY KEY AUTOINCREMENT,
    "timestamp" REAL NOT NULL,
    "ttl" INTEGER NOT NULL,
    "query" TEXT UNIQUE NOT NULL,
    "results" TEXT
);

-- Table: users
CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT);
INSERT INTO users VALUES (1, 'John Doe', '[email protected]');

HTTP Status: 200

Summary

  • Final Status: completed
  • Dump ID: fb2e9497-d93a-457d-b96d-8cd1ae2d22fb
  • Number of status checks: 3
  • Callbacks Received: 5

onyedikachi-david avatar Jan 25 '25 12:01 onyedikachi-david

@onyedikachi-david Apologies for the delay here, will be testing this implementation this week and report back!

Brayden avatar Feb 03 '25 15:02 Brayden

@onyedikachi-david Apologies for the delay here, will be testing this implementation this week and report back!

Okay.

onyedikachi-david avatar Feb 04 '25 07:02 onyedikachi-david

image

curl --location 'https://starbasedb.{YOUR-IDENTIFIER}.workers.dev/export/dump' \
--header 'Authorization: Bearer ABC123'

@onyedikachi-david Was looking at this and attempting to test but running into issues. To start testing I changed this value to 0 so the size to go to R2 was anything:

// const SIZE_THRESHOLD_FOR_R2 = 100 * 1024 * 1024 // 100MB threshold for using R2 <- Commented this out
const SIZE_THRESHOLD_FOR_R2 = 0 // <- Testing with this

But when attempting to hit the cURL request above to begin the database dump I see the errors in my Cloudflare Worker log above. Do you have good steps to reproduce how you got the database dump to work as expected? I setup my R2 bucket successfully and linked it in the wrangler.toml as well.

Brayden avatar Feb 09 '25 19:02 Brayden

@Brayden I noticed the issue might be related to the endpoint being used. I added a new endpoint for chunk dumps:

curl -X POST 'https://starbasedb.{YOUR-IDENTIFIER}.workers.dev/export/dump/chunked' \
--header 'Authorization: Bearer ABC123' \
--header 'X-Callback-URL: http://your-callback-url'

Would you retry with this, I don't have a paid Cloudflare instance, I would have tried replicating it on Cloudflare.

onyedikachi-david avatar Feb 11 '25 07:02 onyedikachi-david

Thanks for the review; I implemented some fix, could you please check.

onyedikachi-david avatar Feb 13 '25 14:02 onyedikachi-david

@onyedikachi-david Now for all of my tests for some reason when I call to my /status endpoint I see the following response where processedTables never increments from 0.

{
    "result": {
        "status": "in_progress",
        "progress": {
            "currentTable": "",
            "processedTables": 0,
            "totalTables": 9
        }
    }
}

It does seem to create an initial file in my R2 bucket with only the following contents:

SQLite format 3�

But no matter how long I wait I'm not seeing the tables get processed. For what it's worth, when I do go to test this PR there are two functions in your chunked-dump.ts file that require an input of env that is always undefined beecause the Handler file does not have access to the environment either, so instead I have to modify the code to accept the env.DATABASE_DUMPS as an R2Bucket option and feed that in.

I know this was pretty close to working before so hoping we're not far off from it fully working? 🤞

Brayden avatar Feb 17 '25 14:02 Brayden