materialize [Epic] Surfacing Source/Sink Errors

Initiative and Theme

Materialize is Friendly; Materialize makes it easy to understand what is/isn't working

Problem

Subset of https://github.com/MaterializeInc/materialize/issues/7115 relevant for M2. STORAGE needs to surface source/sink errors in Platform. For example, currently if a user enters a wrong password for a Kafka source, it still looks successful from the user's perspective in Platform.

Success Criteria

Source/sink errors are presented clearly to the end user.

QA Testing and Sign Off

[ ] Make sure a comprehensive tests are present for all possible source and sink errors
[ ] Add source and sink errors to the Platform Checks framework (to confirm proper behavior across upgrades and restarts)
[ ] Go through all the known previously-filed GH tickets and make sure that the error is surfaced properly

Time Horizon

Large

Blockers

None

Blocks

P1 deliverable of https://github.com/MaterializeInc/cloud/issues/3259

Jun 02 '22 23:06 nmeagan11

Previously https://github.com/MaterializeInc/materialize/issues/7115

Jun 16 '22 14:06 elindsey

We need to audit which source errors make it into the Err collection, and eventually into the persist shard. As of now, I think too many errors make it in there that shouldn't be there. They instead should go to the new status shard that @andrioni introduced.

10_000 ft summary: Only definite errors should make it into the Err collection. System errors/runtime errors, which are transient should not make it in there.

Aug 10 '22 08:08 aljoscha

This internal message (and thread) is important for things we need to thing about when we pick the work on this back up: https://materializeinc.slack.com/archives/C01CFKM1QRF/p1660726837927649.

We need to figure out:

who writes to the status shard, who advances it's upper when nothing happens?
how does the storage controller learn about the upper of the status shard?

Potential solutions from the thread:

storage controller or the storaged instances make sure the upper advances
we change storage controller to actively listen for upper changes of all shards/collections it knows about. Instead of controller relying of StorageResponse::Uppers and append() advancing the upper.

Pluswhich we need to rename this type of collection (that mz_source_status_history is) from "storage collection", because all collections that storage knows about are really storage collections.

Aug 17 '22 17:08 aljoscha

Linking an issue we came across in external testing that we should surface.

Aug 22 '22 15:08 nmeagan11

Below is a collection of problems that users have encountered in the wild.

Sources:

[x] Using the wrong SASL_MECHANISMS option causes the source to complain under the hood (see https://github.com/MaterializeInc/materialize/issues/12864#issuecomment-1222544893).
[ ] Flagging if an upstream database has a setting that doesn't allow the WAL held back by a slot fo exceed a certain size during source creation (reference).
[x] "Hidden" connection string issues and network timeouts (reference, another reference).
[x] Kafka read errors. For example, when using Redpanda, we need WRITE priviledges to perform ingestion in addition to READ, DESCRIBE, and DESCRIBE CONF (reference)
[x] Topic #0/1: vori_prod.public.vendor_products with 0 partitions: Broker: Unknown topic or partition (reference)

Sinks:

[ ] Sinks stop emitting without any errors in the logs (reference).

Sep 26 '22 17:09 nmeagan11

Semi-related: https://github.com/MaterializeInc/materialize/issues/15892

Nov 04 '22 17:11 nmeagan11

Linking in https://github.com/MaterializeInc/materialize/issues/15949 - which is triggered by the increased write contention on the status shards. It's a blocker for taking this feature out of unsafe mode.

Nov 09 '22 20:11 bkirwi

Went through @nmeagan11's comment here and checked through the linked issues. Most of them are in-principle fixed by the implementation currently in main. (In the sense that, given our best understanding of the customer's issue getting set up, we'd have something helpful for them in the new status tables.) A couple are not:

The postgres configuration check should in principle be a warning at creation / purification time.
For sinks, we still ignore any errors in the upstream collection. We really ought to do something with those errors, but we're not sure what: none of the behaviours we can come up with for this seem like the obvious right thing for users.

I don't think either of those are blockers for this epic, though I'll confirm that they're tracked somewhere if we don't get to them before we close this out.

Dec 05 '22 18:12 bkirwi

The required work on this epic is considered done. 🎉 @philip-stoev has worked with @bkirwi on tests and signed off on this. We will treat additional error scenarios as future work and track separately.

Dec 20 '22 10:12 uce