materialize
materialize copied to clipboard
storage: maintain source and sink statistics through restarts
This pr change source and sink statistics to avoid counters/gauges being reset when mz/replicas/sources/sinks are restarted. The motivation behind this is two-fold:
- It clarifies the behavior of statistics during edge cases, which increases our confidence that the new source progress statistics we are adding make sense
- It WILDLY simplifies the handling of statistics in the console/by users (particularly when calculating rates)
At its core, its not a huge change, but in practice, it requires a lot of adjustments and code movement to get it right. This pr has many commits, but most are straightforward (tests/code movement/small changes). The two main commits are:
_per_worker
-> _raw
The former adjusts the _statistics_per_worker
collections to _statistics_raw
. Because we now need to maintain counters and gauges through changes in the source, and replicas can change their number of workers, its no longer reasonably feasible to hold into per-worker statistics. We lose history whenever we change this schema, so I want to only do this once, which is why I added the new statistics columns, despite them not yet being hooked up. This is the pr to bikeshed the naming and schema of these columns, so we avoid churn in the future.
Note also that we rely on this schema change to be able to safely unpack statistics on envd restart.
The core changes
This commit contains the core changes. All working in concert:
- Read and unpack the existing current values for statistics from the collection when bootstrapping envd.
- Enforce the resetting behavior of different counters and gauges with type-level wrappers in
mz_storage_client::statistics
. - Build on https://github.com/MaterializeInc/materialize/pull/25135 to aggregate the per-worker data in clusterd into a single update-per-source/sink to communicate to the controller (this simplifies the controller code)
- A large large large amount of mechanical changes in support of the above 3 changes. I did my best to make it comprehensible.
Motivation
-
This PR adds a known-desirable feature.
-
This PR refactors existing code.
Tips for reviewer
see above
Checklist
- [x] This PR has adequate test coverage / QA involvement has been duly considered.
- [x] This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design. https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/20240108_source_metrics_2.md
- [x] If this PR evolves an existing
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label. - [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
- [ ] This PR includes the following user-facing behavior changes:
Mitigations
Completing required mitigations increases Resilience Coverage.
- [x] (Required) Code Review
🔍 Detected
- [ ] (Required) Feature Flag
- [x] (Required) Integration Test
🔍 Detected
- [ ] (Required) Observability
- [x] (Required) QA Review
🔍 Detected
- [ ] (Required) Run Nightly Tests
- [ ] Unit Test
Risk Summary:
The pull request poses a high risk with a score of 83, influenced by factors such as the average line count and executable lines within files. It's noteworthy that historically, pull requests with these characteristics are 140% more likely to introduce bugs compared to the repository baseline. Additionally, the changes involve 2 files that have recently seen a high number of bug fixes, further contributing to the risk. The repository's observed bug trend is currently increasing, although this is not directly tied to the risk score.
Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.
Bug Hotspots: What's This?
File | Percentile |
---|---|
../session/vars.rs | 94 |
../src/lib.rs | 98 |
I triggered a nightly and coverage build; I will report the results when they are ready.
Nightly looks good. I will annotate coverage results inline.
@nrainer-materialize some things won't be covered until the new metrics are actually hooked up; they are all just 0/NULL now!
Rest of nightlies look good!!
@petrosagg this is rebased on the pr that did the code movement; it also has new commits that change all the names to the ones decided on in: https://materializeinc.slack.com/archives/C0637RN7PKQ/p1707975829845539?thread_ts=1707956060.532079&cid=C0637RN7PKQ
It should be reviewable as 1 big pr, after familiarizing yourself with the main changes that are being made