datahub icon indicating copy to clipboard operation
datahub copied to clipboard

Abnormal version growth in dataFlow objects

Open oleg-ruban opened this issue 2 years ago • 0 comments

Describe the bug: While Airflow Dags/Tasks info ingestion, the datahub backend database grows by a large number of versions rows. It happens when Airflow Dag has more than one tag. I assume that this is due to a sorting issue: two consecutive values of the metadata_aspect_v2.version field have the same list in the metadata_aspect_v2.metadata field, but in a different sort.

version     metadata                                                        createdon
79          {"tags":[{"tag":"urn:li:tag:bash"},{"tag":"urn:li:tag:test"}]}  2022-08-03 16:30:35.595
78          {"tags":[{"tag":"urn:li:tag:test"},{"tag":"urn:li:tag:bash"}]}  2022-08-03 15:50:34.517
77          {"tags":[{"tag":"urn:li:tag:bash"},{"tag":"urn:li:tag:test"}]}  2022-08-03 15:40:22.824
76          {"tags":[{"tag":"urn:li:tag:test"},{"tag":"urn:li:tag:bash"}]}  2022-08-03 15:20:25.205

To Reproduce:

  • Add to Airflow Dag args more than one tag
 dag_args = {
    'description': 'Test DAG for evaluation BashOperator that run echo "Hello Airflow!"',
    'concurrency': 1,
    'max_active_runs': 1,
    'start_date': today('UTC').add(days=-1),
    'schedule_interval': '*/10 * * * *',  # every 10 minutes
    'catchup': False,
    'tags': ['bash', 'test'],
}

  • execute several times this Dags

  • run the following query on the datahub backend database:

 select * from public.metadata_aspect_v2
    WHERE urn = 'urn:li:dataFlow:(<dag name>,<env>)'
    AND aspect = 'globalTags'
    ORDER BY createdon DESC 

Expected behavior: An extra version row is added if the Dag tags list is updated

oleg-ruban avatar Aug 03 '22 15:08 oleg-ruban

The above pr should fix this, thanks @oleg-ruban for reporting.

treff7es avatar Aug 22 '22 11:08 treff7es