materialize storage: real-time recency MVP

MVP for real-time recency. Once this merges, it should be appropriate for private preview.

Motivation

This PR adds a known-desirable feature. #16060

Checklist

[x] This PR has adequate test coverage / QA involvement has been duly considered.
[ ] This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
[ ] If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
[ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
[ ] This PR includes the following user-facing behavior changes:

Feb 13 '24 16:02 sploiselle

@MaterializeInc/testing Could use y'all's expertise in developing a testing suite against this prototype of real-time recency. The idea behind RTR is that when you issue a RTR query, we wait until we've ingested all data that is visible in the external system at the "same" physical time that the query gets issued.

Ideally, we would:

Introduce a delay in ingesting it simulating ingestion lag (but not a transient network error from the broker)
Produce Kafka data
Issue a non-RTR query and show the data is not present
Issue a RTR query and show that the data is present.

The delay we introduce would need to be long enough that we can convince ourselves that we're not just "getting lucky."

Which testing framework would you recommend for this? My guess is instrumenting the Kafka source with a failpoint to introduce the latency and then everything else could be done in testdrive.

One small wrinkle with that approach is that I would ultimately like to query multiple Kafka sources, each with their own distinct latency and getting that wired up with failpoint seems fussy, though maybe it's doable?

Feb 13 '24 16:02 sploiselle

For the ingestion lag toxiproxy is good, we already use it in some tests, you can make it as slow as you want in ingesting. Testdrive seems fine for the queries. I think the kafka-resumption tests in test/kafka-resumption/mzcompose.py come relatively close to this.

One small wrinkle with that approach is that I would ultimately like to query multiple Kafka sources, each with their own distinct latency and getting that wired up with failpoint seems fussy, though maybe it's doable?

This is doable with Toxiproxy, you can have multiple Kafka sources, all with their own delays, see for example workflow_sink_networking.

If this is too cumbersome, I can also write a test for you.

Feb 13 '24 23:02 def-

@def- This is in a shape that we can begin testing it. The RTR operation itself is expensive, so opening 100 (1000? 10?) clients doing simultaneous RTR queries might fall over in some unexpected way. The most meaningful tests will ensure we uphold our semantics, but performance is a secondary concern.

I'm going to work on seeing if there's a way to make this the default for testing queries from Kafka sources but that seems a little tricky.

Feb 16 '24 13:02 sploiselle

@nrainer-materialize Here's the real-time recency work we chatted about expanding the tests of. I believe @def- had some outstanding tests, but am not sure exactly what he'd planned or had in mind.

May 02 '24 13:05 sploiselle

@nrainer-materialize Here's the real-time recency work we chatted about expanding the tests of. I believe @def- had some outstanding tests, but am not sure exactly what he'd planned or had in mind.

Thanks! I will have a look and work on it next week.

May 02 '24 13:05 nrainer-materialize

@rjobanp You have bandwidth/interest in adding a MySQL real-time recency test akin to the Kafka and PG ones? I'm sure you're much defter than I with their dialect of SQL, so thought I'd ask. nbd if you're bandwidth constrained.

May 03 '24 18:05 sploiselle

@sploiselle based on the pg-rtr test it should be almost identical for you to add a mysql one -- the only change I that generate_series doesn't exist, but you can use the @i:=@i+1 pattern we have here: https://github.com/rjobanp/materialize/blob/766595589acdf619512ac8057b5df5cfb7c40187/test/mysql-cdc/mzcompose.py#L164

May 03 '24 20:05 rjobanp

Oh, I do get some errors for https://github.com/MaterializeInc/materialize/pull/25566 in https://buildkite.com/materialize/nightly/builds/7683. Though, I haven't had the time to take a deeper look yet.

May 07 '24 15:05 nrainer-materialize

Alright. Regarding testing:

1) RTR enabled everywhere

https://github.com/MaterializeInc/materialize/pull/25463, originally by Dennis, enables RTR in all CI. This PR is a one-shot test and not supposed to be merged.

Tests pipeline

MySQL CDC and Postgres CDC fail due to a result mismatch. The test receives more values than expected. Both failures persist after a retry. => Fixed with https://github.com/MaterializeInc/materialize/pull/27063.
For this test run, I increased the timeout of different tests, including testdrive tests. Nonetheless, 5 of the 8 sharded runs time out. (Original timeout: 40 minutes per sharded instance, average run time: about 10-13 minutes per sharded instance, adjusted timeout: 80 minutes per sharded instance) => There is a major performance impact!
https://github.com/MaterializeInc/materialize/pull/25566/commits/f0e295a25044eb66e10279b19ab07c0ae7ce957b will fix one testdrive failure.

Nightly

The feature benchmark fails in scenario StartupLoaded, which is three times slower. However, I am actually surprised that we don't see more performance regressions here, given the testdrive build timeouts in the tests pipeline.
Testdrive build time out here as well, although I had increased timeouts for nightly as well.
Four parallel workload tests fail due to timed out before ingesting the source\'s visible frontier when real-time-recency query issued.

2) Additional tests

https://github.com/MaterializeInc/materialize/pull/25566, originally by Dennis, adds additional RTR tests and extends existing test frameworks (data-ingest, parallel-workload). That branch is based on the one of this PR. All additional commits of that branch should be cherry-picked to this branch (except for https://github.com/MaterializeInc/materialize/pull/25566/commits/719ac3096a3582cbec08caf6f78f9e2f93875ffb, which was cherry-picked from main and could cause conflicts).

Added tests in tests pipeline

Of the added tests, there are some that do not pass but presumably should. They are in workflow resumption in kafka-rtr, manipulate the connection, fail due to BrokerTransportFailure (Local: Broker transport failure), and were disabled with https://github.com/MaterializeInc/materialize/pull/25566/commits/80714258569f1e3c120a3646da62b3b63026ba1f. Consequently, the build step with the remaining workflow passes.

Nightly

The nightly pipeline (nearly) passes after adjusting some timeouts.

Original build: https://buildkite.com/materialize/nightly/builds/7683
Follow-up 1: https://buildkite.com/materialize/nightly/builds/7702 (the parallel-workload was seen and fixed on main, should be gone after a rebase)
Follow-up 2: https://buildkite.com/materialize/nightly/builds/7705

Summary

To sum up, there still seem to be some issues:

The feature comes with a major performance impact.
~~Postgres-CDC and MySQL-CDC results appear to no longer be correct when using this feature.~~ (issue in the tests fixed with https://github.com/MaterializeInc/materialize/pull/27063)
When run in concurrency (parallel-workload), the problem timed out before ingesting the source\'s visible frontier when real-time-recency query issued may occur.
Resumption tests with this feature were not successful.

Let me know if you have any questions!

May 09 '24 15:05 nrainer-materialize

@nrainer-materialize Thank you SO much for the detailed reporting here. The nightly timeouts make sense to me but I'm pretty spooked by the result mismatches

May 09 '24 15:05 sploiselle

@sploiselle, please let me know if you need further support or want me to retest something. Thank you.

May 14 '24 09:05 nrainer-materialize

@nrainer-materialize Will do! I'm making my way through the feedback you provided and so far all of the issues have been either issues of scale or subtle issues with the tests themselves. Planning a detailed accounting of what's to be done, but I'm feeling good that this prototype is in the right shape.

May 14 '24 12:05 sploiselle

Mitigations

Completing required mitigations increases Resilience Coverage.

[x] (Required) Code Review 🔍 Detected
[ ] (Required) Feature Flag
[x] (Required) Integration Test 🔍 Detected
[x] (Required) Observability 🔍 Detected
[x] (Required) QA Review 🔍 Detected
[ ] (Required) Run Nightly Tests
[ ] Unit Test

Risk Summary:

The risk score for this pull request is high at 83, indicating a significant likelihood of introducing a bug. This assessment is driven by predictors such as the average line count and the number of executable lines within files. There are 8 modified files that are known hotspots for bugs, which increases the risk. Historically, pull requests with similar characteristics are 124% more likely to cause a bug compared to the repository's baseline. While the repository's predicted bug trend is decreasing, which is a positive sign, the observed bug trend remains steady.

Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.

Bug Hotspots: What's This?

File	Percentile
../inner/peek.rs	94
../src/lib.rs	91
../session/vars.rs	97
../src/coord.rs	99
../sources/mysql.rs	95
../sequencer/inner.rs	98
../coord/message_handler.rs	96
../src/lib.rs	98

May 14 '24 23:05 shepherdlybot[bot]

The feature comes with a major performance impact.

The performance impact mentioned here is largely (entirely?) related to latency. We expect the latency of RTR queries to exceed those of non-RTR queries commensurate with the speed at which we ingest data and propagate the ingestion through the entire system.

The prototype itself doesn't introduce much undue latency to the queries (i.e. envd cannot figure out much more quickly than listening to the remap shard when a value's been ingested).

When run in concurrency (parallel-workload), the problem timed out before ingesting the source's visible frontier when real-time-recency query issued may occur.

The scalability of the prototype is not great! We open a separate connection to the upstream object for each query. I expect the number of concurrent queries we support to be quite low, and this is something that we can both get customer feedback on, as well as easily improve. (e.g. an easy win is to only allow one outstanding RTR query per connection object, and we can stash all queries waiting on that object in a queue. Once the current query returns, we can issue another RTR timestamp fetch, and the timestamp returned is valid for all queued queries).

Resumption tests with this feature were not successful.

The resumption tests are, unfortunately, pathologically structured and can never succeed. One of the tenants of the current prototype is that we use the same connection object as the source itself (i.e. it must have the same parameters, etc.). The current design of the resumption tests introduces a partition (of sorts) between envd and the source's upstream object, and then issues a RTR query. Unfortunately, this partition not only affects the ingestion, but also impedes our ability to determine the RTR timestamp (which is why all of the queries fail).

A trickier design here would be to issues the RTR query and then introduce the issue and then fix it. The timing of all of this seems very racy, though so I'm not sure exactly how we'd want to instrument it.

May 14 '24 23:05 sploiselle

@MaterializeInc/storage This is ready for a code review. Note that this doesn't include a mysql-rtr test, though it probably should. QA tested that the existing tests do pass with RTR enabled and this is only moving to PrPr, so I didn't sweat it too profusely.

May 14 '24 23:05 sploiselle

@MaterializeInc/storage This is ready for a code review. Note that this doesn't include a mysql-rtr test, though it probably should. QA tested that the existing tests do pass with RTR enabled and this is only moving to PrPr, so I didn't sweat it too profusely.

I can add some if we consider that important!

May 15 '24 07:05 nrainer-materialize

@MaterializeInc/storage This is ready for a code review. Note that this doesn't include a mysql-rtr test, though it probably should. QA tested that the existing tests do pass with RTR enabled and this is only moving to PrPr, so I didn't sweat it too profusely.

@sploiselle: I added mysql-rtr tests with https://github.com/MaterializeInc/materialize/pull/25195/commits/06b63ed44c1b6412b3a7a0362d8d76f42c8e7d05.

May 15 '24 09:05 nrainer-materialize

@nrainer-materialize tysm!

May 15 '24 15:05 sploiselle

@petrosagg Addressed all of your feedback modulo erroring if you have RTR enabled and query a load generator source. I think the ergonomics of that are less friendly than they could be (we let you query everything else that doesn't actually support RTR; making this an exception feels odd to me). I'd prefer to educate users about those semantics in documentation.

Also gave up on semantic commits and everything is just in one commit now. Apologies if that introduces any overhead for you.

May 21 '24 14:05 sploiselle