materialize icon indicating copy to clipboard operation
materialize copied to clipboard

[Epic] Use Kafka Formats in Platform

Open nmeagan11 opened this issue 2 years ago • 15 comments

Initiative and Theme

Materialize is Friendly; Materialize works with your existing pipelines

Problem

Some users do not use what we are supporting in M1 (FORMAT BYTES). We need to bring back the formats from pre-Platform to make our Kafka sources useable for more folks. Order of Priority:

  • Avro (P1)
  • JSON (any more work to do here?) (P1)
  • Text (P1)
  • CSV (P2)
  • Protobuf (P3)

Success Criteria

Users can successfully execute a CREATE SOURCE ... FROM KAFKA statement using the same formats they were able to use pre-Platform.

Tasks

  • [x] Test formats (Avro, JSON, Text, CSV, Protobuf)
    • [x] Coordinate with DevEx team about existing tests (see also: https://github.com/MaterializeInc/developer-experience/issues/166)
    • [x] Document results
  • [x] Persist upstream schema from Confluent Schema Registry (maybe)
    • [x] Decide whether we want to do this

QA Sign-off

  • [x] Make sure all formats are represented in testdrive tests and the Platform Checks framework

Time Horizon

Small

Blockers

None

nmeagan11 avatar Jun 02 '22 20:06 nmeagan11

Time Horizon

6 weeks

I think we're basically getting these "for free"! Check with @elindsey and @petrosagg to be sure, but I don't think there's any additional work to do here.

benesch avatar Jun 12 '22 21:06 benesch

I think we're basically getting these "for free"!

I'll leave this one to @elindsey as he mentioned to me some bits that need to be done.

nmeagan11 avatar Jun 13 '22 15:06 nmeagan11

Bump on this one! @elindsey or @petrosagg—what remains to be done here?

benesch avatar Jun 27 '22 01:06 benesch

I picked Eli's brain on Slack. The outstanding work items are:

  • Testing that the formats work in Materialize Cloud.
  • Determining whether we need to persist the schemas we read from the Confluent Schema Registry, or whether we're comfortable relying on the upstream registry in perpetuity.

It also occurs to me that the FROM SCHEMA FILE option for Avro/Protobuf formats needs to be removed, because there is no way to upload a schema file in platform. That makes our tests tricky, though...

benesch avatar Jun 27 '22 20:06 benesch

@nmeagan11 can you coordinate with @bobbyiliev to test the above formats in Materialize Cloud and document (or link the results here)? We need a way to track these with confidence as we move into Previews.

Can we also create or link an issue for (1) determining whether we need to persist schemas or not and also for (2) the question of determining how to deal with FROM SCHEMA FILE in the future?

heeringa avatar Aug 07 '22 17:08 heeringa

@heeringa Updated the original issue description to reflect the open tasks here.

uce avatar Aug 08 '22 11:08 uce

@nmeagan11 can you coordinate with @bobbyiliev to test the above formats in Materialize Cloud and document (or link the results here)? We need a way to track these with confidence as we move into Previews.

@heeringa, please see the linked devex issue.

nmeagan11 avatar Aug 08 '22 15:08 nmeagan11

@benesch, can you confirm the status of FROM SCHEMA FILE? Was it removed? @uce, @aljoscha, and I weren't sure!

nmeagan11 avatar Aug 08 '22 15:08 nmeagan11

@benesch, can you confirm the status of FROM SCHEMA FILE? Was it removed? @uce, @aljoscha, and I weren't sure!

It's gated behind unsafe mode, so we're good for the purposes of GA. It's tech debt we need to clean up at some point though—a bunch of our internal tests still rely on the feature! Details in #13703.

benesch avatar Aug 10 '22 05:08 benesch

Marking the "Test formats" task as complete after conversation with @bobbyiliev that everything is working as expected.

nmeagan11 avatar Aug 19 '22 15:08 nmeagan11

Marking "Persist upstream schema from Confluent Schema Registry (maybe)" as complete since we decided not to do it (reference).

The remaining items are QA sign-off (cc @philip-stoev) and the tech debt clean up of FROM SCHEMA FILE.

nmeagan11 avatar Aug 22 '22 18:08 nmeagan11

@nmeagan11 how are you thinking about persisting schemas for the future? Icebox and re-evaluate at every planning cycle based on demand? Something else?

heeringa avatar Aug 24 '22 20:08 heeringa

Icebox and re-evaluate at every planning cycle based on demand?

Exactly!

nmeagan11 avatar Aug 24 '22 21:08 nmeagan11

I don't think we'll ever need to persist schema information to support CREATE SOURCE ... FORMAT AVRO. We added the work item back in the day before the architecture of platform was as fleshed out. After the linked Slack conversation, I'm pretty convinced what we're doing is safe.

There is however a desire to support a standalone avro_decode function, which would require a standalone CSR source. That's a bit pie in the sky still, but is tracked in #14133.

benesch avatar Aug 25 '22 05:08 benesch

I created a separate tracking issue for the FROM SCHEMA FILE clean up (https://github.com/MaterializeInc/materialize/issues/14911) and I removed https://github.com/MaterializeInc/materialize/issues/12304#issuecomment-1134457022 as a blocker to this epic because it's not a priority for our current milestone. Now that all tasks are complete and we have QA sign-off, I think we're ok to close this epic as complete (@uce to confirm)!

nmeagan11 avatar Sep 21 '22 16:09 nmeagan11