John Spray

Results 144 issues of John Spray

Extend metrics, including at least: - HTTP API status counters + latency histograms for the controller's HTTP API - Latency/error counters tagged by node_id for outbound calls to pageserver -...

t/feature
c/storage/controller

Example: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6805/7957035402/index.html#suites/140824de6e814b5b1ae2b622c3f67840/6cd46f9911ed5b0f In that run, the compute hook (local version, using neon_local `Endpoint`) is hanging, causing migration to time out.

t/bug
a/test
c/storage

Currently, for tenant shards attached to a node whose availability state is set to offline, we demote it to a secondary in the IntentState, and schedule another node to be...

t/feature
c/storage/controller

If we restart all our services at the same time, something awkward happens: - Pageservers depend on accessing the storage controller's `/upcall/v1/re-attach` endpoint to proceed with startup. - The storage...

a/tech_debt
c/storage/controller

This PR adds an async/await compatible interface for librados's aio methods. A new Completion type wraps Ceph's completions into a `Future`, including the cancel-on-drop behaviour expected of rust futures. For...

There was a broken `resources` block (bad indentation), with tiny resource limits. This PR fixes the indentation and sets more modest limits (1Gb, 1 CPU core). This facilitates better alerting...

By design, Redpanda will sometimes leave orphan objects in its object storage bucket. This happens when a node writes a segment, but then unexpectedly loses leadership before it can update...

kind/enhance
area/cloud-storage

The per-timeline histogram-per-op-type of page_service latencies makes up the vast majority of the metrics output from pageservers, and is very rarely used. We already have node-wide versions of these stats....

c/storage/pageserver
a/tech_debt

A design for a cheap low-resource state for idle timelines: - #8088 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. -...

c/storage/pageserver
t/tech_design_rfc

The scrubber's scan_metadata command will flag some inconsistencies, but it's not quite robust enough to trust at scale: - It needs to handle objects being deleted while it scans, by...

t/feature
c/storage/scrubber