OpenMetadata Deltalake without metastore

we currently use pyspark w/ delta support to connect to an external metastore and extract metadata about the delta tables.

We could expand the support to fetch tables directly from S3. In that case, we could rely on https://pypi.org/project/deltalake/.

Moreover, we should double-check how we are listing & filtering tables in the connector to only fetch delta tables

Jan 26 '24 06:01 pmbrull

@pmbrull lets assign this to someone in the team

Mar 11 '24 05:03 harshach

Actually, you don't need to use the deltalake package. You can use delta-spark for this too. I made an example based on your code already. It would definitely keep the code of the connector a lot cleaner. I am not sure if it will work, because my research into your code didn't go deep enough. But it gives a general idea. The deltalake package has certain limitations that you might want to take in consideration. I would only use this package if you want a connector that can be used without Spark.

if isinstance(connection.metastoreConnection, S3Config):
        if connection.metastore.Connection.awsAccessKeyId:
            builder.config(
                "spark.hadoop.fs.s3a.access.key", 
                connection.metastore.Connection.awsAccessKeyId
            )
        if connection.metastore.Connection.awsSecretAccessKey:
            builder.config(
                "spark.hadoop.fs.s3a.secret.key", 
                connection.metastore.Connection.awsSecretAccessKey
            )
        if connection.metastore.Connection.endPointURL:
            builder.config(
                "spark.hadoop.fs.s3a.endpoint", 
                connection.metastore.Connection.endPointURL
            )

Anyway, this would be an useful addition, since it is common practice to store data in DeltaLake with S3 as storage.

May 20 '24 18:05 AHulshoff

For more details on this, but also other platforms: https://docs.delta.io/latest/delta-storage.html

May 21 '24 07:05 AHulshoff