amundsen
amundsen copied to clipboard
Apache Kafka integration
Any ideas how to integrate with Apache Kafka topics.
@polya20 we haven't added the support for Kafka topic yet. But per my understanding with Kafka, I think if you have schema registry deployed on your end, you could topic metadata info from there and persisted in Amundsen. But you also need to build UI for that.
Not sure if schema-registry will be a valid source as in latest versions schemas are not anymore uniquely linked to one topic https://docs.confluent.io/current/schema-registry/serializer-formatter.html#sr-avro-subject-name-strategy.
I'd prefer a kafka-topic-metadata-extractor
that uses Admin client to gather topic metadata.
Would this work with current data model, mapping a kafka topic as a table? or is a stream entity support coming in later versions?
@jeqo I think it needs a new model for kafka topic .
How would the Kafka topic model look like?
Here is a list of metadata that pops into my heads.
- topics
- schemas
- offset info
- partitions
- consumer & producer info
And here are some ideas for the frontend and databuilder.
Frontend
- Instead of date range used for hive partitions, maybe frontend can show offset range and such for Kafka topics.
- Replace
table-users
with Kafka consumers andtable-owners
with producers?
Databuilder Databuilder will periodically gather data via Kafka Admin client and gather metadata. If we don't use schema-registry, the topics' schema will have to be inferred from the latest offset. If the data structured like Protobuf or Avro, inferring schema won't be too hard. But if some Kafka topics use unstructured data like text or CSV, we might have to pull data from schema-registry.
@jacobhjkim are you suggesting to reuse the table dataset support for Kafka topic? I think ideally Kafka topic should be treated as a separate entity, but Kafka also maps nicely to tradition table concept :)
also cc @lukelowery as the team from reddit expresses interests in supporting Kafka topic in Amundsen.
Like this http://schema-registry-ui.demo.lenses.io/ or https://www.apicur.io/registry, it would be superpowerful for event-driven schema register. And all the database schema can be express in schema register..
Hi guys, any update on the progress of this feature ?
I don't think it is just a matter of schema registry. There are some caveats that need to be taken into consideration like:
- What is the source of the topic? A JDBC table?
- Who uses the topic? Snowflake? Neo4j?
- and others... Because this could also help to reconcile data lineage for those workflows that use Kafka in order to move data between data sources; I guess that a good approach would be managing not only the Schema-Registry, but also the whole Kafka Connect platform that basically allows to:
- push the data to a Kafka Topic from any arbitrary data source via a module called
Source
- ingest the data from a topic into any data source via module called
Sink
- manipulate each topic message via Transforms
Why was this issue closed??
+1
Why was this issue closed??
+1
We have changed the way we manage issues, you can read more here: https://amundsenworkspace.slack.com/archives/CGGHH0XSB/p1671130550809429
If you are interested on this issue being tackle, please react with 👍 in the original issue post!