delta icon indicating copy to clipboard operation
delta copied to clipboard

[Feature Request] Best practices for managing Delta Lake table metadata

Open allenhaozi opened this issue 2 years ago • 9 comments

Feature request

A configuration that automatically synchronizes metadata to an external metadata management tool, such as Apache Atlas, DataHub, OpenMetadata, etc., when creating and updating delta Lake tables

We use S3 to store data

allenhaozi avatar Apr 23 '22 02:04 allenhaozi

Hi @allenhaozi , thanks for making this issue. Can you tell us

  • your use case(s)
  • more details on what sort of feature/configuration/API you'd like and how you would integrate it into your existing pipeline

scottsand-db avatar Apr 25 '22 20:04 scottsand-db

I asked the same question in Slack https://delta-users.slack.com/archives/CJ70UCSHM/p1650600105595819

cc @dennyglee

allenhaozi avatar Apr 26 '22 09:04 allenhaozi

Delta Lake is a library, not a service

As administrator of Delta Lake:

  1. How many tables have been created and what is the path of each table
  2. What is the schema of each table

I can get this information by scanning the storage path and parsing the transaction log, But which paths to scan can be a configuration ?

As a delta Lake user:

  1. I want a centralized place to see tables that are within the scope of permissions
  2. view table metadata information
  3. write the PySpark script based on metadata information

Based on this background:

Assuming that:

Delta Lake provides an API

Every time a table is created/updated/deleted, a message is sent to this interface. this message contains all the metadata information for this operation

import pyspark
from delta import *

builder = (
    pyspark.sql.SparkSession.builder.appName("s3-demo")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .config("spark.sql.catalog.spark_catalog.eventhook", "{your-event-process-service}") # The parameters I'm assuming don't exist
)

This event processing services may be a follows the delta lake message standard metadata services, such as datahub, openmetadata, Apache Atlas

delta lake:

  1. event message standard
  2. Send an event message to the registered hook address

metadata:

  1. Metadata receives messages and persists them to their own standards

The real situation is more complex than this, such as a failure to send a message resulting in inconsistent data

@tcondie-db

allenhaozi avatar Apr 26 '22 09:04 allenhaozi

Although the OpenLineage project (and Marquez as a reference implementation) focuses on data lineage specifically, many data catalogs cater to this aspect as well. I cannot help seeing some similarities here.

They implement the integration with the help of a separate library.

vjraitila avatar Apr 26 '22 10:04 vjraitila

Thanks @allenhaozi - small clarification in that Delta Lake is a framework vs. library as it has a lot more moving parts. Saying this, is this something folks would be up for getting together to create a design document? From a high level, I think its a great idea but wanted to dive deeper. As @vjraitila noted, there are reconciliation issues between the various lineage projects and data catalogs. If so, please let me know and I'd be glad to find some time for us to chat on this.

dennyglee avatar May 03 '22 17:05 dennyglee

Although I did bring up OpenLineage specifically, there are, unfortunately competing "standards" in the field. In contrast, OpenMetadata tries to take a more holistic approach. Then there is Apache Atlas etc. Not to mention all the commercial vendors in the field.

Therefore, whatever the mechanism is, it would have to be extensible somehow.

I, disregarding implementation feasibility, sort of like the approach @allenhaozi presented. An ability to somehow attach an event trigger into DDL statements and an ability to push the event out, schema attached, even in some Delta-specific format. But then the event handler/receiver would have to do the necessary translation to whatever API the catalog implementation provides. Combined with something like Kafka, it would even be extremely scalable - although likely an overkill.

Alternatively, some type of a plugin architecture comes to mind - but would such be possible with DeltaCatalog? Or is there some other place in the pipeline for triggering events when table schemas change without bringing in a dependency on Hive?

Finally, I see lots of catalogs tackling the lineage and table metadata with separate mechanisms anyway. Lineage by scouring through query logs or by hooking to the orchestration layer, and table schemas through Hive - or similar. I personally would be happy even if we came up with a solution only for the latter.

vjraitila avatar May 06 '22 07:05 vjraitila

@vjraitila @allenhaozi Perhaps we can jump on a sync to discuss this or do it async on Slack?

dennyglee avatar May 17 '22 02:05 dennyglee

hi @dennyglee We recently implemented our data lake solution based on Delta Lake, and metadata is an important piece Currently managing metadata, we scan the table path of deltalake rules by polling external components, then inject metadata into the metadata component Polling has several problems such as delay, resource consumption, inconsistency and so on Other good practices, we hope to explore in our practice, if we have new ideas, we communicate here

allenhaozi avatar Jun 11 '22 09:06 allenhaozi

That's great to hear @allenhaozi - let me know if you have some time to chat about this via the Delta Users Slack as this may also be blog / docs worthy, eh?!

dennyglee avatar Jun 21 '22 03:06 dennyglee