starbasedb icon indicating copy to clipboard operation
starbasedb copied to clipboard

Database dumps do not work on large databases

Open Brayden opened this issue 11 months ago • 11 comments

Describe the bug If you try to use any of the database dump endpoints such as SQL, CSV or JSON the data is loaded into memory and then created as a dump file. To support any size database we should investigate enhancements to allow any sized database to be exported. Currently the size limitations are 1GB for Durable Objects with 10GB in the future. Operate under the assumption that we might be attempting to dump a 10GB database into a .sql file.

Another consideration to make is because Durable Objects execute synchronous operations we may need to allow for "breathing intervals". An example might be we allow our export operation to run for 5 seconds, and take 5 seconds off if other requests are in a queue, then it can pick up again. The goal here would be to prevent locking the database for long periods of time.

But then poses the questions:

  1. How do we continue operations that need more than 30 seconds to work?
  2. Where is the data stored as it's being created? (R2, S3, something else)?
  3. How do we deliver that dump information to the user after its completed?

To Reproduce Steps to reproduce the behavior:

  1. Hit the /export/dump endpoint on a large database
  2. Will eventually fail when the 30 second request response time window closes

Run the following command in Terminal (replace the URL with yours) and if your operation exceeds 30 seconds you should see a failed network response instead of a dump file.

curl --location 'https://starbasedb.YOUR-ID-HERE.workers.dev/export/dump' \
--header 'Authorization: Bearer ABC123' \
--output database_dump.sql

If you can't create a large enough test database feel free to add code in to sleep for 29 seconds before proceeding with the /export/dump functional code and should also see the failure.

Expected behavior As a user I would expect any and all of the specified data to be dumped out without an error and without partial results. Where it ends up for the user to access if the operation takes more than 30 seconds is up for discussion. Ideally if shorter than 30 seconds it could be returned as our cURL above works today (downloads the file from the response of the origin request), but perhaps after the timeout it continues on uploads it to a destination source to access afterwards?

Proposed Solution:

  1. For backups require an R2 binding
  2. Have a .sql file that gets created in R2 with the filename like dump_20240101-170000.sql where it represents 2024-01-01 17:00:00
  3. Create the file and continuously append new chunks to it until reaching the end
  4. May need to utilize a DO alarm to continue the work after X time if a timeout occurs & mark where it currently is in the process in internal memory so it can pick up and continue.
  5. Provide a callback URL when the operation is finally completed so users can create custom logic to notify them (e.g. Email, Slack, etc)

Brayden avatar Dec 29 '24 14:12 Brayden

/bounty $250

Brayden avatar Jan 24 '25 17:01 Brayden

💎 $250 bounty • Outerbase (YC W23)

Steps to solve:

  1. Start working: Comment /attempt #59 with your implementation plan
  2. Submit work: Create a pull request including /claim #59 in the PR body to claim the bounty
  3. Receive payment: 100% of the bounty is received 2-5 days post-reward. Make sure you are eligible for payouts

Thank you for contributing to outerbase/starbasedb!

Add a bounty • Share on socials

Attempt Started (GMT+0) Solution
🟢 @onyedikachi-david Jan 25, 2025, 5:26:58 AM #76
🟢 @BenraouaneSoufiane Jan 27, 2025, 3:54:09 PM WIP
🟢 @Kunal-Darekar Mar 6, 2025, 6:11:41 AM #93

algora-pbc[bot] avatar Jan 24 '25 17:01 algora-pbc[bot]

Hi, @Brayden. instead of using DO alarm API we can use queues if the database is size greater than certain size (say 200 or 500)?

My suggestion is to attach the cloudflare queues to the new private endpoints where the consumer and producer are a single worker which will export the data into R2 for certain time say (20 to 30sec) depends on usage limits, this way we can have the queues with the capability of the last record of the export queue and there's blockage of any request/response.

https://developers.cloudflare.com/durable-objects/api/sql-storage/#databasesize we can depend on this property of DO to trigger the queue or something like.

b4s36t4 avatar Jan 24 '25 19:01 b4s36t4

Hey @b4s36t4!

Instead of using the DO alarm API, I think queues is also an acceptable approach. Just to talk through it a bit more so I fully understand how you're thinking of approaching it. A couple of questions:

  • When you mention attaching to new private endpoints, would this be a new Worker altogether? Could it instead be a Starbase plugin perhaps? Here's a blog outlining how they work and some examples are also included in the repo (https://www.outerbase.com/blog/extensibility-with-plugins/)
  • When you say we can base it off if the database is size greater than some defined number (e.g. 200 or 500) would we fetch the database size at some interval to see if the difference in size from the last export meets the criteria and auto-run the backup?
  • Could we allow for users to define say like "backup daily" or "backup weekly" and trigger at that time?

Thanks for looking into this!

Brayden avatar Jan 24 '25 21:01 Brayden

/attempt #59

Algora profile Completed bounties Tech Active attempts Options
@onyedikachi-david 14 bounties from 7 projects
TypeScript, Python,
JavaScript & more
Cancel attempt

onyedikachi-david avatar Jan 25 '25 05:01 onyedikachi-david

💡 @onyedikachi-david submitted a pull request that claims the bounty. You can visit your bounty board to reward.

algora-pbc[bot] avatar Jan 25 '25 12:01 algora-pbc[bot]

/attempt #59 may I ask you for cloudflare paid account because I don't have one (it may resolved without cloudflare account following your guidances)

BenraouaneSoufiane avatar Jan 27 '25 15:01 BenraouaneSoufiane

/cancel #59

BenraouaneSoufiane avatar Feb 02 '25 14:02 BenraouaneSoufiane

/attempt #93

Hey @Brayden Implementation Plan for Database Dump Feature:

  1. Core Implementation:

    • Create chunked processing system for database dumps
    • Implement progress tracking mechanism
    • Add R2 storage integration for dump files
    • Build callback notification system
  2. Key Functions to Implement:

    • startChunkedDumpRoute(): Initialize dump process
    • processDumpChunk(): Handle data chunks
    • getDumpFileRoute(): Retrieve dump files
    • finalizeDump(): Complete dump process
  3. Testing Strategy:

    • Unit tests for all core functions
    • Integration tests for the complete flow
    • Edge case handling (empty DBs, large datasets)
    • Error scenarios and recovery
  4. Features to Cover:

    • Large database support through chunking
    • Progress monitoring
    • SQL escaping
    • Callback notifications
    • R2 storage integration
  5. Timeline:

    • Core implementation: Completed
    • Testing: Completed
    • Documentation: Completed
    • PR ready for review

Status: Implementation complete and ready for review. All tests passing.

Image

Options

Kunal-Darekar avatar Mar 06 '25 06:03 Kunal-Darekar

💡 @Kunal-Darekar submitted a pull request that claims the bounty. You can visit your bounty board to reward.

algora-pbc[bot] avatar Mar 07 '25 20:03 algora-pbc[bot]

@Brayden Apologies, I mistakenly deleted my original PR while making some update .I’ve created a new PR #93 and included the necessary documentation, including the wrangler.toml variables and the cURL command/endpoint for triggering. Everything should be clear and ready for testing. Could you please check? Thanks for your patience!

Kunal-Darekar avatar Mar 16 '25 19:03 Kunal-Darekar