[Epic] Surfacing Source/Sink Errors
Initiative and Theme
Materialize is Friendly; Materialize makes it easy to understand what is/isn't working
Problem
Subset of https://github.com/MaterializeInc/materialize/issues/7115 relevant for M2. STORAGE needs to surface source/sink errors in Platform. For example, currently if a user enters a wrong password for a Kafka source, it still looks successful from the user's perspective in Platform.
Success Criteria
Source/sink errors are presented clearly to the end user.
QA Testing and Sign Off
- [ ] Make sure a comprehensive tests are present for all possible source and sink errors
- [ ] Add source and sink errors to the Platform Checks framework (to confirm proper behavior across upgrades and restarts)
- [ ] Go through all the known previously-filed GH tickets and make sure that the error is surfaced properly
Time Horizon
Large
Blockers
None
Blocks
P1 deliverable of https://github.com/MaterializeInc/cloud/issues/3259
Previously https://github.com/MaterializeInc/materialize/issues/7115
We need to audit which source errors make it into the Err collection, and eventually into the persist shard. As of now, I think too many errors make it in there that shouldn't be there. They instead should go to the new status shard that @andrioni introduced.
10_000 ft summary: Only definite errors should make it into the Err collection. System errors/runtime errors, which are transient should not make it in there.
This internal message (and thread) is important for things we need to thing about when we pick the work on this back up: https://materializeinc.slack.com/archives/C01CFKM1QRF/p1660726837927649.
We need to figure out:
- who writes to the status shard, who advances it's upper when nothing happens?
- how does the storage controller learn about the upper of the status shard?
Potential solutions from the thread:
- storage controller or the
storagedinstances make sure the upper advances - we change storage controller to actively listen for upper changes of all shards/collections it knows about. Instead of controller relying of
StorageResponse::Uppersandappend()advancing the upper.
Pluswhich we need to rename this type of collection (that mz_source_status_history is) from "storage collection", because all collections that storage knows about are really storage collections.
Linking an issue we came across in external testing that we should surface.
Below is a collection of problems that users have encountered in the wild.
Sources:
- [x] Using the wrong
SASL_MECHANISMSoption causes the source to complain under the hood (see https://github.com/MaterializeInc/materialize/issues/12864#issuecomment-1222544893). - [ ] Flagging if an upstream database has a setting that doesn't allow the WAL held back by a slot fo exceed a certain size during source creation (reference).
- [x] "Hidden" connection string issues and network timeouts (reference, another reference).
- [x] Kafka read errors. For example, when using Redpanda, we need
WRITEpriviledges to perform ingestion in addition toREAD,DESCRIBE, andDESCRIBE CONF(reference) - [x]
Topic #0/1: vori_prod.public.vendor_products with 0 partitions: Broker: Unknown topic or partition(reference)
Sinks:
- [ ] Sinks stop emitting without any errors in the logs (reference).
Semi-related: https://github.com/MaterializeInc/materialize/issues/15892
Linking in https://github.com/MaterializeInc/materialize/issues/15949 - which is triggered by the increased write contention on the status shards. It's a blocker for taking this feature out of unsafe mode.
Went through @nmeagan11's comment here and checked through the linked issues. Most of them are in-principle fixed by the implementation currently in main. (In the sense that, given our best understanding of the customer's issue getting set up, we'd have something helpful for them in the new status tables.) A couple are not:
- The postgres configuration check should in principle be a warning at creation / purification time.
- For sinks, we still ignore any errors in the upstream collection. We really ought to do something with those errors, but we're not sure what: none of the behaviours we can come up with for this seem like the obvious right thing for users.
I don't think either of those are blockers for this epic, though I'll confirm that they're tracked somewhere if we don't get to them before we close this out.
The required work on this epic is considered done. 🎉 @philip-stoev has worked with @bkirwi on tests and signed off on this. We will treat additional error scenarios as future work and track separately.