mosaic icon indicating copy to clipboard operation
mosaic copied to clipboard

Graceful recovery from data loss

Open derekperkins opened this issue 8 months ago • 5 comments

While working on a server component, the data was all in memory and so server restarts would wipe everything. Using the default UI athletes example, it never recovers. After a restart, all requests fail with an error like this.

Error: Query failed with HTTP status 500: query: failed to execute query: Catalog Error: Table with name athletes does not exist! Did you mean "pg_tables"?

LINE 1: SELECT "weight", "height", "sex" FROM "athletes" AS "source" WHERE (("name" IS NOT DISTINCT FROM... ^

at RestConnector.query (rest.ts:49:13) at async QueryManager.submit (QueryManager.ts:104:20)

The coordinator should be able to detect that class of error, recreate the table, then resend the original query, so the user doesn't even realize that a recovery has happened other than maybe some additional latency.

This is relevant for not only any server implementation, where something could cause an issue, but as support for OPFS and client side persistence gets added, tables could get cleared from underneath the application.

I'm not sure if we'd need to standardize error responses or some other way to reliably let the coordinator detect it.

derekperkins avatar Jun 28 '25 15:06 derekperkins

I like the idea of catching these kinda of errors (although I haven't come across it myself yet). There are different options we could implement this. Do you have a suggestion?

domoritz avatar Jun 28 '25 18:06 domoritz

Nothing comes to mind. I don't have a ton of duckdb experience, but I haven't seen any error codes like I'd expect from MySQL or other databases I've worked with. Searching the DuckDB code for Catalog Errors, I don't see any structured errors. Maybe there isn't a better option than just string matching the error.

I looked at the Rust and Go implementations to see how they're interpreting errors, but I don't know that there's much there to use in this situation.

  • https://docs.rs/duckdb/latest/duckdb/enum.Error.html
  • https://github.com/marcboeker/go-duckdb/blob/main/errors.go

derekperkins avatar Jun 28 '25 18:06 derekperkins

If I understand it correctly, the issue is that the database crashes and is not persistent (e.g., no disk-based store), and so the loaded tables are lost. In this case Mosaic isn't really set up (at present, anyway) to help. Table creation queries are passed through and otherwise forgotten (and in some cases the tables may be entirely created outside of Mosaic calls!), so Mosaic Core doesn't have the information necessary to re-load the tables.

Similarly, without such information we can't recognize what class of error has occurred. Did the database crash and needs reloading? Or, is a client simply buggy and asking for the wrong table name?

We would need to add some kind of additional abstraction that tracks the table definitions / creation queries and can recognize when they need to be re-initialized. I think this might be achieved in a way that is largely separable from the existing API -- with the one exception of catching errors (query failures) for analysis. If someone wants to experiment with solutions here, I'm happy to advise, but don't have time to work on this myself at the moment.

jheer avatar Jun 28 '25 21:06 jheer

I'm architecting how to go into production with this. We're running in kubernetes, and my thought is that we'll run on an ephemeral local ssd for performance reasons, vs running on a more persistent network disk. If the container restarts or has to be rescheduled to another node, the disk is wiped. Reloading from scratch shouldn't be an issue, it's just a question of where to have that logic live.

Table creation queries are passed through and otherwise forgotten

Ah, I didn't realize it was just passthrough, I thought the coordinator had more state information. Presumably the client would get back the missing table error, and could reconstitute the necessary state by resending the initial query.

If someone wants to experiment with solutions here, I'm happy to advise, but don't have time to work on this myself at the moment.

I'll keep an eye on this as we continue testing things out, and experiment with options both inside and outside of mosaic.

derekperkins avatar Jun 28 '25 23:06 derekperkins

in some cases the tables may be entirely created outside of Mosaic calls

Is there an suggested lifecycle for this behavior? We have PB scale data in BigQuery, but as with most data, only a small subset of that is typically in use. The plan is to ingest that subset into DuckDB at customer login (or whatever triggers). Assuming that the data isn't available natively via a DuckDB connector, we'd have to bypass vg.coordinator().exec() to make sure the DuckDB table exists before mosaic takes over. Is there any downside to that vs having the coordinator oversee it? If there is value to the coordinator having a knowledge of it, maybe there's a companion command to exec that ships arbitrary commands through the same api channel, with a name and whatever arbitrary data is required by the server.

await vg.coordinator().command('populateFromBigQuery', {arbitrary: "user", provided: "values"});

Maybe there isn't value in the coordinator knowing, which is also fine. I appreciate your feedback so far as I'm probing the edges of the system to better understand where to draw the boundary between what mosaic is responsible for and what it isn't.

Table creation queries are passed through and otherwise forgotten (and in some cases the tables may be entirely created outside of Mosaic calls!), so Mosaic Core doesn't have the information necessary to re-load the tables

I understand where my misconception came from now. I was mistaking the declarative nature of the Mosaic Spec to represent the entirety of the mosaic flow, not just a transform into the more imperative flow of the coordinator.

derekperkins avatar Jun 29 '25 02:06 derekperkins