materialize
materialize copied to clipboard
adapter: add a SessionMeta struct for off-thread work
An upcoming refactor will move the ctx ownership during peeks out of the individual stages, making it not possible to share the session across tasks. The optimizer was designed to work across tasks, but we were breaking that abstraction boundary by passing sessions by reference instead of some object fully owned by the task.
Teach sessions how to generate a snapshot of their needed data for fn eval_unmaterializable_func called a SessionMeta, a reference-less struct containing mostly cheap data copies. It does have one exception to thas, which is that Vars must allocate a new btree and value strings because Var is not Clone (and is hard to make so), so isn't easily shareable. Overall we believe this var snapshotting, even though it is done on every request, is still cheaper than any other option we considered.
Abstract this new struct and the old Session with a common trait and use that in places where it's supported. Due to vars being tricky some things do require a SessionMeta, but we hope an upcoming Var refactor will allow this restriction to be lifted.
Our goal is to make the coord thread is un-contested as possible, and thus move work into off-thread tasks. If we did not make the current change, we would have to do something like split up the optimize stage into two stages, and in between them walk the expr and apply eval_unmaterializable_func to it on the main coord thread (since it owns the Session). This seems like it'd be at least as much CPU time as it takes to allocate a new btree and some Strings.
Other options we considered and rejected were:
- share the session with an
Arc(bad because during ctx.retire and other operations we'd have to wait an unknown amount of time to be the only strong reference holder and confuses ownership responsibilities) - use some immutable map data structure for session vars that has cheap snapshotting/cloning (not possible because values need to be
Clone, andVaris not) - change the sequence_staged API to pass around ctx by owner instead of reference (bad because its current API allows for regular
?returns, thus not needing to wrap errors inreturn_if_err!, and has a single owner of ctx with really clear responsibilities about which component needs to retire it)
Motivation
- This PR refactors existing code.
Checklist
- [ ] This PR has adequate test coverage / QA involvement has been duly considered.
- [ ] This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
- [ ] If this PR evolves an existing
$T ⇔ Proto$Tmapping (possibly in a backwards-incompatible way), then it is tagged with aT-protolabel. - [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
- [ ] This PR includes the following user-facing behavior changes:
- n/a
Mitigations
Completing required mitigations increases Resilience Coverage.
- [x] (Required) Code Review
🔍 Detected - [ ] (Required) Feature Flag
- [ ] (Required) Integration Test
- [ ] (Required) Observability
- [x] (Required) QA Review
🔍 Detected - [ ] (Required) Run Nightly Tests
- [ ] Unit Test
Risk Summary:
The pull request carries a high risk score of 81, indicating a significant likelihood of introducing a bug, with historical data showing that similar PRs are 114% more likely to cause a bug compared to the repository baseline. This risk is driven by predictors such as the sum of bug reports of the files affected and the change in executable lines. Additionally, 10 files modified in this PR are known hotspots for recent bug fixes, which further elevates the risk. The repository's predicted and observed bug trends both show a decrease, which is a positive sign but independent of the risk score for this specific pull request.
Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.
Bug Hotspots: What's This?
I triggered a nightly build and a coverage build; I will report the results when they are ready. 😀
The nightlies are reporting query latency increases which feels probably correct. We're working on a way to not do allocations by allowing Var to exist in immutable data structures, which may fix that problem.
Another idea: do some preprocessing (in one of the off-thread stages) to detect if a query has any unmat fns in itself or dependent views and only populate the session meta if that is true. The detection and application of this could both be done off coord main thread. We could even do a thing where in the "no unmat fns" case, we can then also skip the prep_expr_scalar step because we know it'll do nothing.
I triggered a nightly build and a coverage build; I will report the results when they are ready. 😀
As already noted by @mjibson, the feature benchmark and scalability framework reported regressions. It will be great if you can fix this problem. If this is not possible and we need to accept some performance deterioration, please reach out to me so that I can add appropriate ignore entries.
This is now erroring with
error[E0599]: no method named `iter` found for struct `Box<dyn Iterator<Item = (usize, usize, usize)>>` in the current scope
--> src/cluster/src/server.rs:203:22
|
202 | let layers = layers
| ______________________________-
203 | | .iter()
| | -^^^^ help: there is a method with a similar name: `filter`
| |_____________________|
and I'm very confused because I have changed what I thought were no files even close to that file and Box<dyn Iterator>...uh does not have iter()? Is it because it's already an iterator?
Additional commit with more efficient session vars. Retriggered benchmarks: https://buildkite.com/materialize/nightlies/builds/6730
Regarding coverage: all of the comments above appear to already have a test that covers them. Does the coverage testing include both testdrive and slt?
Yes, assuming the tests run through successfully.
Benchmarks passed (no failing regressions). RFAL with the new session stuff now.