John Spray
John Spray
Extend metrics, including at least: - HTTP API status counters + latency histograms for the controller's HTTP API - Latency/error counters tagged by node_id for outbound calls to pageserver -...
Example: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6805/7957035402/index.html#suites/140824de6e814b5b1ae2b622c3f67840/6cd46f9911ed5b0f In that run, the compute hook (local version, using neon_local `Endpoint`) is hanging, causing migration to time out.
Currently, for tenant shards attached to a node whose availability state is set to offline, we demote it to a secondary in the IntentState, and schedule another node to be...
If we restart all our services at the same time, something awkward happens: - Pageservers depend on accessing the storage controller's `/upcall/v1/re-attach` endpoint to proceed with startup. - The storage...
This PR adds an async/await compatible interface for librados's aio methods. A new Completion type wraps Ceph's completions into a `Future`, including the cancel-on-drop behaviour expected of rust futures. For...
There was a broken `resources` block (bad indentation), with tiny resource limits. This PR fixes the indentation and sets more modest limits (1Gb, 1 CPU core). This facilitates better alerting...
By design, Redpanda will sometimes leave orphan objects in its object storage bucket. This happens when a node writes a segment, but then unexpectedly loses leadership before it can update...
The per-timeline histogram-per-op-type of page_service latencies makes up the vast majority of the metrics output from pageservers, and is very rarely used. We already have node-wide versions of these stats....
A design for a cheap low-resource state for idle timelines: - #8088 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. -...
The scrubber's scan_metadata command will flag some inconsistencies, but it's not quite robust enough to trust at scale: - It needs to handle objects being deleted while it scans, by...