datahub
datahub copied to clipboard
Spark lineage not recorded in DataHub from AWS Glue
Describe the bug I am evaluating DataHub as a data catalog and lineage tool for the Data Platform and using this article - https://aws.amazon.com/blogs/big-data/part-2-deploy-datahub-using-aws-managed-services-and-ingest-metadata-from-aws-glue-and-amazon-redshift/ as an example. After running Glue job, I can see Spark DataPipeline and DataTask created in DataHub however the lineage doesn't exist. Gms logs look ok without errors. Identical issue has been raised previously https://github.com/datahub-project/datahub/issues/8997 but then closed without any comments. Another related issue closed without resolution or explanation - https://github.com/datahub-project/datahub/issues/5724
To Reproduce Steps to reproduce the behavior:
- Follow instructions from Capture data lineage in the web-page mentioned above
- Run glue job
- Open datahub UI and go to Platform->Spark.
- Go to DataTask created after the glue job run and open Lineage section
Expected behavior Can see upstream and downstream components in the lineage for the DataTask
Screenshots
If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
- OS: [e.g. iOS]
- Browser [e.g. chrome, safari]
- Version [e.g. 22]
Additional context Add any other context about the problem here.
Is there any update on this issue?
This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io
any update on this issue ?
This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io
This still hasn't been answered so can't really close it
@denystyshetskyy @vinothdataeng Our latest Spark plugin (0.2.8 or 0.2.9) supports Glue, and please give it a try: On Glue please set:
The spark.datahub.stage_metadata_coalescing
config parameter and Glue don't send an explicit application end event.
Hi @treff7es for your response.
I am a bit confused with what jar to use for Glue jobs (datahub-spark-lineage or acryl-spark-lineage)?
If I used acryl-spark-lineage version 0.2.9, which I assume is what you are referring to, I managed to get the lineage created after the glue job finishes. The issue I get now is that the result Hive dataset table doesn't have a schema in it.
Is there anything else I need to do to make the schema come through?
Also, when I run the glue ingestion from the Datahub CLI, it also creates a new Glue dataset for the same table.
So now for the same table I have 1 Hive table generated by glue job with spark and 1 Glue table generated by glue ingestion.
@denystyshetskyy:
- Spark plugin, by default, only emits the upstream lineage edge and not the datasets and the schema. If you want the plugin. Ideally, you should capture lineage with the specific DataHub source (for example using MySql source to capture datasets and schema). If you want to use the spark lineage plugin you can enable it with the following config parameters:
-
--conf "spark.datahub.metadata.dataset.materialize=true"
to materialize datasets and--conf "spark.datahub.metadata.dataset.experimental_include_schema_metadata=true"
to capture schema from the Spark Plugin
- To capture glue tables as glue and not as hive you should use set the config property:
--conf "spark.datahub.metadata.dataset.hivePlatformAlias=glue"