datahub
datahub copied to clipboard
Abnormal version growth in dataFlow objects
Describe the bug:
While Airflow Dags/Tasks info ingestion, the datahub backend database grows by a large number of versions rows.
It happens when Airflow Dag has more than one tag.
I assume that this is due to a sorting issue: two consecutive values of the metadata_aspect_v2.version
field have the same list in the metadata_aspect_v2.metadata
field, but in a different sort.
version metadata createdon
79 {"tags":[{"tag":"urn:li:tag:bash"},{"tag":"urn:li:tag:test"}]} 2022-08-03 16:30:35.595
78 {"tags":[{"tag":"urn:li:tag:test"},{"tag":"urn:li:tag:bash"}]} 2022-08-03 15:50:34.517
77 {"tags":[{"tag":"urn:li:tag:bash"},{"tag":"urn:li:tag:test"}]} 2022-08-03 15:40:22.824
76 {"tags":[{"tag":"urn:li:tag:test"},{"tag":"urn:li:tag:bash"}]} 2022-08-03 15:20:25.205
To Reproduce:
- Add to Airflow Dag args more than one tag
dag_args = {
'description': 'Test DAG for evaluation BashOperator that run echo "Hello Airflow!"',
'concurrency': 1,
'max_active_runs': 1,
'start_date': today('UTC').add(days=-1),
'schedule_interval': '*/10 * * * *', # every 10 minutes
'catchup': False,
'tags': ['bash', 'test'],
}
-
execute several times this Dags
-
run the following query on the datahub backend database:
select * from public.metadata_aspect_v2
WHERE urn = 'urn:li:dataFlow:(<dag name>,<env>)'
AND aspect = 'globalTags'
ORDER BY createdon DESC
Expected behavior: An extra version row is added if the Dag tags list is updated
The above pr should fix this, thanks @oleg-ruban for reporting.