quickwit icon indicating copy to clipboard operation
quickwit copied to clipboard

leverage s3 conditional write for concurrent s3 metastore

Open trinity-1686a opened this issue 6 months ago • 3 comments

Is your feature request related to a problem? Please describe. today the only way to have an HA metastore is to deploy a postgres cluster (ideally setup with redundancy itself). this is a pain point, people would often prefer to use a simpler to setup S3 metastore if that provided the same level of availability

Describe the solution you'd like provide a configuration option to allow concurrently running S3 metastores. This can be achieved by leveraging S3 conditional writes. As this is not something all alternative S3 providers supports, this should be optional, and we should, if possible, detect if the option is enabled but the provider doesn't support it (for instance, write to a random object, overwrite it with a conditional write that should fail, and verify it does indeed)

Describe alternatives you've considered people can already use the postgres metastore, or have a non replicated metastore

trinity-1686a avatar Jun 11 '25 09:06 trinity-1686a

I think @fmassot would love to see this happen.

I think there are different to have this work. One possibility: use a broken version of leader election. e.g. the metastore with the highest id is the leader. All writes go through this "broken" notion of a leader.

Leader writes using conditional writes to ensure its write are consistent. Upon failure it reloads the model. For searchers, upon search, if they have not written nor read the index in the last N seconds, they need to pull a up-to-date state.

TLDR:

  • optimistic strategy with conditional write brings correctness
  • shitty leader election brings performance

fulmicoton avatar Jun 12 '25 07:06 fulmicoton

What would be the projected performance difference? I would assume its still possible to see occasional latency spikes, and timeouts from S3 which could cause a lot of down stream problems.

But it sounds nicer than running HA postgress

esatterwhite avatar Jun 14 '25 23:06 esatterwhite

Are there any ongoing experiments related to this?

If not, I'd like to explore two options.

  1. Aurora DSQL backend: Aurora DSQL is a fully managed, Postgres-compatible serverless database. It's currently the easiest-to-use database option available on AWS. Since DSQL doesn't offer all Postgres features, some queries may need to be modified for compatibility. If this is feasible, it could significantly reduce operational costs for AWS users.

  2. SlateDB (on top of S3) backend: SlateDB is an S3-backed KV database. It optimizes reads and writes using LSM and disk cache, and already implements CAS-based writer fencing. SlateDB is a common Rust library, so it can be easily embedded.

Perhaps it's possible to use an existing file-backed metastore on top of zerofs (which is backed by slatedb), but I believe a dedicated key-value design is better for efficient reads and minimized writer contention.

cometkim avatar Oct 01 '25 04:10 cometkim