materialize
materialize copied to clipboard
[epic] sql+storage: `CONNECTION`s supported + required
Folks agreed we should do this work, so turning this into a tracking ticket.
Systems
- [x] Kafka (#11504)
- [x] Confluent schema registry (#11504)
- [x] Postgres (#13552)
- [x] SSH (#14293)
- [x] AWS, i.e. S3 + Kinesis (#13665)
- [x] PubNub (#14294)
Requires connection
- [ ] Kafka
- [ ] Sources (#14295)
- [x] Sinks (#14214)
- [x] Confluent schema registry
- [x] Sources (#14296)
- [x] Sinks (#14297)
- [x] Postgres sources
- [x] SSH sources (#14293)
- [x] AWS, i.e. S3 + Kinesis sources (#13665)
- [x] PubNub sources (#14294)
Feature request
For background on CONNECTION
, see the design
doc.
We have the option to require references to CONNECTION
objects to connect to
external systems, e.g. Kafka, Postgres, which means removing the ability to
express the connection inline.
Ad hoc syntax
> CREATE MATERIALIZED SOURCE data
FROM KAFKA BROKER 'kafka:9092' TOPIC 'testdrive-data-${testdrive.seed}'
WITH (
security_protocol = 'SSL',
ssl_key_pem = SECRET ssl_key_kafka,
ssl_key_password = SECRET ssl_key_password_kafka,
ssl_certificate_pem = '${arg.materialized-kafka-crt}',
ssl_ca_pem = '${arg.ca-crt}'
)
FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY '${testdrive.schema-registry-url}'
WITH (
ssl_key_pem = SECRET ssl_key_csr,
ssl_key_password = SECRET ssl_key_password_csr,
ssl_certificate_pem = '${arg.materialized-schema-registry-crt}',
ssl_ca_pem = '${arg.ca-crt}',
username = "materialize",
password = SECRET password_csr
)
ENVELOPE DEBEZIUM
Connector syntax
> CREATE CONNECTION kafka_ssl
FOR KAFKA
BROKER 'kafka:9092',
SSL KEY = SECRET ssl_key_kafka,
SSL KEY PASSWORD = SECRET ssl_key_password_kafka,
SSL CERTIFICATE = '${arg.materialized-kafka-crt}',
SSL CERTIFICATE AUTHORITY = '${arg.ca-crt}';
> CREATE CONNECTION csr_ssl
FOR CONFLUENT SCHEMA REGISTRY
URL '${testdrive.schema-registry-url}',
SSL KEY = SECRET ssl_key_csr,
SSL KEY PASSWORD = SECRET ssl_key_password_csr,
SSL CERTIFICATE = '${arg.materialized-kafka-crt}',
SSL CERTIFICATE AUTHORITY = '${arg.ca-crt}'
USERNAME = 'materialize',
PASSWORD = SECRET password_csr;
> CREATE MATERIALIZED SOURCE connector_source
FROM KAFKA CONNECTION kafka_ssl
TOPIC 'testdrive-data-${testdrive.seed}'
FORMAT AVRO
USING CONFLUENT SCHEMA REGISTRY CONNECTION csr_ssl
ENVELOPE DEBEZIUM
You can also see this in situ in the
kafka-ssl
test.
Concretely, the proposal is to disallow the "Ad hoc" syntax entirely, and only support the "Connector syntax." If we do not accept or implement this proposal, we will accept either syntax.
Context
- We are in the process of turning all
WITH
options to rely on enum-based names, rather than arbitrary strings, as a means of providing more consistent code paths in the planner. This means breaking lots of backward compatibility with e.g. source definitions, so now is a great time/the only time to make this shift. - The current process for describing options for sources is complex, e.g. all of
the security options are placed alongside other options, despite the fact that
they're most likely the same for all connections to the broker. Removing the
number of valid
WITH
options in any spot simplifies the UX of creating sources. - The SQL + Storage + DevEx teams have all agreed on the approach for
CONNECTION
objects. However, if we permit using eitherCONNECTION
or ad hoc option definition, we will provide two means to accomplish the same goal. It seems ideal to provide only one means of accomplishing this task. - Removing the ability to define connection parameters in the ad hoc options
will let us simplify code paths, e.g. we will not need to add
TryFrom<GeneralSourceOptions> for SourceConnectionOptions
.
The downside to this proposal is the same as its upside: reduced flexibility in creating resources.
Work entailed
After creating CONNECTION
objects for all external systems (e.g. Postgres,
AWS) (work that we are doing anyway), removing the inline definition code paths,
and amending tests to only use CONNECTION
.
I'm a +1 for disallowing the "ad hoc" syntax entirely. It will be simpler for the user to have just one way for doing something. Here's an internal thread with more discussion.
@heeringa asks:
For developers that will use this programmatically via an API, does it make sense to have a single command version?
We could add an inline definition of the connection when creating a source:
CREATE MATERIALIZED SOURCE connector_source
FROM KAFKA CONNECTION kafka_ssl (
BROKER 'kafka:9092',
SSL KEY = SECRET ssl_key_kafka,
SSL KEY PASSWORD = SECRET ssl_key_password_kafka,
SSL CERTIFICATE = '${arg.materialized-kafka-crt}',
SSL CERTIFICATE AUTHORITY = '${arg.ca-crt}'
)
TOPIC 'testdrive-data-${testdrive.seed}'
FORMAT AVRO
USING CONFLUENT SCHEMA REGISTRY CONNECTION csr_ssl
ENVELOPE DEBEZIUM
If we see that users are vexed by the separate DDL commands (one for the connection, one for the resource), we could add an option like this in the future.
Thanks for providing the examples. Hopefully this set of DDL statements is mostly run by admins only once upon account creation, so adding steps here might not result in as many total extra steps added to user workflow, but here are some concerns and ideas:
In the path to first successful usage
This is in the path to the "AHA" moment - Imagine writing a quick start tutorial or sitting over someone's shoulder walking them through creation of their first materialized view before and after this change.
Ideas to address We've talked about giving people code snippets or templates that write 90% of the SQL for common connection scenarios. That might help.
Is the troubleshooting experience getting better or worse?
We have seen in the community that real-world Kafka connections with various configurations of certs, SASL and Schema Registry are often the first place users get stuck. If there was a way to provide immediate feedback of errors upon executing a CREATE CONNECTION
statement, troubleshooting might get easier! If not, I'm worried it's getting more difficult.
Will you be able to run DROP CONNECTION kafka CASCADE;
and have it drop all sources that use it? Would it also drop all views and indexes that use the source?
Also +1 on dropping support for the inline syntax (as discussed in the thread Nicolle linked). I'm so excited about this separation of concerns!
This is in the path to the "AHA" moment - Imagine writing a quick start tutorial or sitting over someone's shoulder walking them through creation of their first materialized view before and after this change.
I think this makes demos and quickstarts cleaner, since you can configure a connection once and spare users from writing a bunch of verbose syntax or copy paste the connection configuration over multiple source (and sink) definitions. It also makes it easier to document in a way that provides more details than what we currently have in each connector page, which should help cover common mishaps users run into.
If there was a way to provide immediate feedback of errors upon executing a CREATE CONNECTION statement, troubleshooting might get easier! If not, I'm worried it's getting more difficult.
Great point! It'd be neat to be able to "Test connection" like in other databases, if we're making this a separate concern, instead of relying on having to create a source to debug connection issues.
It'd be neat to be able to "Test connection" like in other databases
I would go even further and setup a continuous ping (e.g. once every 30s) for each connection. Since connections are top-level objects, the UX can then provide an overview page with some status field indicating the health of each connection. This could be a great way to troubleshoot sources & sink issues that are due to connectivity issues (authentication, network failure) rather than something more specific (misconfigured shard, bad topic name, a server running in read-only mode, etc.).
I'm sated on the syntax question and convinced having one way to do this is the right choice.
I think we can close this issue, right?
@andrioni Not yet--we haven't removed the ability to e.g. specify Kafka connections inline. You can find some more details from @benesch here https://github.com/MaterializeInc/cloud/issues/2837#issuecomment-1186250863
Ah, sorry, I thought they were all mandatory already! Thanks!
They're mandatory when running with --unsafe-mode
, which was the quick fix I came up so we didn't need to rewrite all the tests at once!
@sploiselle did this land?
@JLDLaughlin No, not yet. There is a lot of work (i.e. typing) remaining to make this happen. I'll spin up some issues to atomize this and make it a little easier to follow along.
@JLDLaughlin @chaas updated this to make surveying the remaining work a little easier. modulo SSH connections, the remaining work is not particularly hard, but is a phenomenal amount of typing.
cc @uce @nmeagan11 as I created some tracking issues for the storage team, as well
This is great, @sploiselle! Thank you!
@benesch, @sploiselle : Once @jkosh44 merges his fix to #14295, is this epic complete? Some of the checkboxes aren't marked, but it looks like every issue will be completed.
@chaas yes
Amazing 🎉!! Thank you @sploiselle. @jkosh44 you can mark this epic closed once you merge your PR.
Huge thank you to @sploiselle and @jkosh44 for slogging through this thankless work!