materialize icon indicating copy to clipboard operation
materialize copied to clipboard

Compute operator hydration status logging

Open teskje opened this issue 1 year ago • 8 comments

This PR makes the compute layer expose operator-level hydration status in two new SQL relations:

  • mz_compute_operator_hydration_status_per_worker
  • mz_compute_operator_hydration_status

(The latter being an aggregation of the former.)

To this end, a hydration logging operator is rendered after each LIR node in the plan, that reports the hydration event to the ComputeState. The ComputeState forwards these events to the compute controller through a new Status compute response type. Finally, the controller writes the hydration events into a storage-managed collection.

See the commit messages for details.

This is part of the Operator Hydration Status Tracking design (https://github.com/MaterializeInc/materialize/pull/24265).

Motivation

  • This PR adds a known-desirable feature.

Part of #23859

Checklist

  • [x] This PR has adequate test coverage / QA involvement has been duly considered.
  • [x] This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • [x] If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • [x] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • [x] This PR includes the following user-facing behavior changes:
    • Add a new relation, mz_internal.mz_compute_operator_hydration_statuses, that exposes compute hydration status at the operator level.

teskje avatar Feb 02 '24 13:02 teskje

I triggered a nightly build: Do we expect any performance impact from this PR? Benchmarks and the scalability framework are failing...

nrainer-materialize avatar Feb 05 '24 14:02 nrainer-materialize

Some performance impact is to be expected, yes. In TPC-H it was 3.5% on average, but with a huge variability, so some benchmarks might see way more than that. The wallclock failures are definitely something to look into.

I'm not so sure about the messages failures. What kind of messages are they counting?

teskje avatar Feb 05 '24 14:02 teskje

the sum of all messages in the system from mz_internal.mz_message_counts_per_worker

The mz_message_counts view describes the messages and message batches sent and received over the [dataflow] channels in the system.

def- avatar Feb 05 '24 14:02 def-

In that case, any messages benchmark failures are expected: This PR adds a new operator for each LIR-level node, through which the LIR node outputs flow. That implies one or two new channels and therefore some amount of new messages.

teskje avatar Feb 05 '24 14:02 teskje

The wallclock failures are definitely something to look into.

I re-ran the nightlies and the only failures now are the expected messages ones. I assume the wallclock failure came from comparing to a version that lacked other changes than the one in this PR, so it was a false positive.

teskje avatar Feb 12 '24 13:02 teskje

Risk Score:81 / 100 Bug Hotspots:4 Resilience Coverage:100%

Mitigations

Completing required mitigations increases Resilience Coverage.

  • [x] (Required) Code Review 🔍 Detected
  • [x] (Required) Feature Flag
  • [x] (Required) Integration Test 🔍 Detected
  • [x] (Required) Observability 🔍 Detected
  • [x] (Required) QA Review 🔍 Detected
  • [x] (Required) Run Nightly Tests
  • [ ] Unit Test
Risk Summary:

The risk score for this pull request is high at 81, indicating a significant chance of introducing a bug. This assessment is based on predictors such as the sum of bug reports of files affected and the change in executable lines of code. Historically, pull requests with similar characteristics are 117% more likely to cause a bug compared to the repository's baseline. Additionally, the pull request modifies 4 files that have been recently identified as hotspots for bugs. The repository's observed bug trend is currently steady.

Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.

Bug Hotspots: What's This?

File Percentile
../controller/instance.rs 99
../session/vars.rs 97
../src/lib.rs 98
../src/controller.rs 97

shepherdlybot[bot] avatar Feb 12 '24 13:02 shepherdlybot[bot]

@antiguru I decided to leave things as they were now. We discussed:

  • Not logging hydration status for operators inside WMR regions. It occurred to me that having only a subset of all LIR nodes show up in mz_compute_operator_hydration_status would make for confusing UX. Also there isn't a benefit in skipping WMR operators: Doing so would make the code more complicated and we are at the moment not concerned about the performance of WMR dataflows.
  • Using probes instead of a dedicated operator. Unfortunately, that's annoying to implement because probe::Handle<T> contains the timestamp type that can vary. Placing these probes in a map would require, I think, a new trait that we can store trait objects of. Also the probe approach is less efficient: On every worker step we would iterate through all held probes to compare their frontiers. Whereas with the dedicated logging operators we need to compare frontiers only when they advance. If we are worried about the logging operator leaving performance on the table, we can convert it to one that uses the raw OperatorBuilder, but looking at the feature benchmarks, that doesn't seem necessary for now.

teskje avatar Feb 12 '24 13:02 teskje

From discussion with @antiguru:

  • We should have a feature flag to disable rendering of logging operators in a pinch. (Added in the last commit.)
  • Use a more specific name than node_id in contexts where it isn't obvious. For example: lir_id. (Done)
  • Rename NodeId to LirId. Too big of a change for this PR, but can be done as a follow-up.

teskje avatar Feb 12 '24 16:02 teskje

The tests failing now are messages feature benchmarks, where regressions are expected due to the new operators.

TFTRs!

teskje avatar Feb 19 '24 16:02 teskje