amundsen icon indicating copy to clipboard operation
amundsen copied to clipboard

Apache Kafka integration

Open polya20 opened this issue 4 years ago • 8 comments

Any ideas how to integrate with Apache Kafka topics.

polya20 avatar Aug 16 '19 23:08 polya20

@polya20 we haven't added the support for Kafka topic yet. But per my understanding with Kafka, I think if you have schema registry deployed on your end, you could topic metadata info from there and persisted in Amundsen. But you also need to build UI for that.

feng-tao avatar Aug 17 '19 01:08 feng-tao

Not sure if schema-registry will be a valid source as in latest versions schemas are not anymore uniquely linked to one topic https://docs.confluent.io/current/schema-registry/serializer-formatter.html#sr-avro-subject-name-strategy.

I'd prefer a kafka-topic-metadata-extractor that uses Admin client to gather topic metadata.

Would this work with current data model, mapping a kafka topic as a table? or is a stream entity support coming in later versions?

jeqo avatar Dec 11 '19 15:12 jeqo

@jeqo I think it needs a new model for kafka topic .

feng-tao avatar Dec 12 '19 18:12 feng-tao

How would the Kafka topic model look like?

Here is a list of metadata that pops into my heads.

  • topics
  • schemas
  • offset info
  • partitions
  • consumer & producer info

And here are some ideas for the frontend and databuilder.

Frontend

  • Instead of date range used for hive partitions, maybe frontend can show offset range and such for Kafka topics.
  • Replace table-users with Kafka consumers and table-owners with producers?

Databuilder Databuilder will periodically gather data via Kafka Admin client and gather metadata. If we don't use schema-registry, the topics' schema will have to be inferred from the latest offset. If the data structured like Protobuf or Avro, inferring schema won't be too hard. But if some Kafka topics use unstructured data like text or CSV, we might have to pull data from schema-registry.

jacobhjkim avatar Jul 13 '20 04:07 jacobhjkim

@jacobhjkim are you suggesting to reuse the table dataset support for Kafka topic? I think ideally Kafka topic should be treated as a separate entity, but Kafka also maps nicely to tradition table concept :)

feng-tao avatar Jul 14 '20 00:07 feng-tao

also cc @lukelowery as the team from reddit expresses interests in supporting Kafka topic in Amundsen.

feng-tao avatar Jul 14 '20 00:07 feng-tao

Like this http://schema-registry-ui.demo.lenses.io/ or https://www.apicur.io/registry, it would be superpowerful for event-driven schema register. And all the database schema can be express in schema register..

wuqunfei avatar Dec 06 '20 22:12 wuqunfei

Hi guys, any update on the progress of this feature ?

han8909227 avatar May 27 '22 22:05 han8909227

I don't think it is just a matter of schema registry. There are some caveats that need to be taken into consideration like:

  • What is the source of the topic? A JDBC table?
  • Who uses the topic? Snowflake? Neo4j?
  • and others... Because this could also help to reconcile data lineage for those workflows that use Kafka in order to move data between data sources; I guess that a good approach would be managing not only the Schema-Registry, but also the whole Kafka Connect platform that basically allows to:
  • push the data to a Kafka Topic from any arbitrary data source via a module called Source
  • ingest the data from a topic into any data source via module called Sink
  • manipulate each topic message via Transforms

conker84 avatar Nov 21 '22 16:11 conker84

Why was this issue closed??

+1

erdebee avatar Jan 09 '23 15:01 erdebee

Why was this issue closed??

+1

We have changed the way we manage issues, you can read more here: https://amundsenworkspace.slack.com/archives/CGGHH0XSB/p1671130550809429

If you are interested on this issue being tackle, please react with 👍 in the original issue post!

Golodhros avatar Jan 09 '23 18:01 Golodhros