John Spray issues

Results 144 issues of


                                            John Spray

storage controller: prometheus metrics & dashboard

Extend metrics, including at least: - HTTP API status counters + latency histograms for the controller's HTTP API - Latency/error counters tagged by node_id for outbound calls to pageserver -...

t/feature

c/storage/controller

Flakiness in test_sharding_split_smoke

Example: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-6805/7957035402/index.html#suites/140824de6e814b5b1ae2b622c3f67840/6cd46f9911ed5b0f In that run, the compute hook (local version, using neon_local `Endpoint`) is hanging, causing migration to time out.

t/bug

a/test

c/storage

storage controller: graceful handling of attempts to reconcile with offline nodes

Currently, for tenant shards attached to a node whose availability state is set to offline, we demote it to a secondary in the IntentState, and schedule another node to be...

t/feature

c/storage/controller

controller: graceful behavior for restarting all services at the same time

If we restart all our services at the same time, something awkward happens: - Pageservers depend on accessing the storage controller's `/upcall/v1/re-attach` endpoint to proceed with startup. - The storage...

a/tech_debt

c/storage/controller

async/await compatible wrapper for librados AIO methods

This PR adds an async/await compatible interface for librados's aio methods. A new Completion type wraps Ceph's completions into a `Future`, including the cancel-on-drop behaviour expected of rust futures. For...

controller: add default resource limits

There was a broken `resources` block (bad indentation), with tiny resource limits. This PR fixes the indentation and sets more modest limits (1Gb, 1 CPU core). This facilitates better alerting...

cloud_storage: bucket scrub

By design, Redpanda will sometimes leave orphan objects in its object storage bucket. This happens when a node writes a segment, but then unexpectedly loses leadership before it can update...

kind/enhance

area/cloud-storage

pageserver: reduce per-timeline histogram metrics

The per-timeline histogram-per-op-type of page_service latencies makes up the vast majority of the metrics output from pageservers, and is very rarely used. We already have node-wide versions of these stats....

c/storage/pageserver

a/tech_debt

rfcs: add RFC for timeline archival

A design for a cheap low-resource state for idle timelines: - #8088 ## Checklist before requesting a review - [ ] I have performed a self-review of my code. -...

c/storage/pageserver

t/tech_design_rfc

scrubber: more robust metadata consistency check

The scrubber's scan_metadata command will flag some inconsistencies, but it's not quite robust enough to trust at scale: - It needs to handle objects being deleted while it scans, by...

t/feature

c/storage/scrubber