datahub icon indicating copy to clipboard operation
datahub copied to clipboard

Make Airflow plugin fully compatible with Airflow >=2.7

Open Linux-oiD opened this issue 7 months ago • 44 comments

The modern Airflow version has openlineage support moved under openlineage provider. According to requirements: https://github.com/datahub-project/datahub/blob/cbe0334fb002253d8f366c32eb88db57af0e6baf/metadata-ingestion-modules/airflow-plugin/setup.py#L47 Current Airflow plugin V2 rely on the old openlineage plugin. It triggers next deprecation warning line in the task logs:

For Airflow 2.7 and later, use the native Airflow Openlineage provider package. Documentation can be found at https://airflow.apache.org/docs/apache-airflow-providers-openlineage

DataHub version: 1.0.0 Airflow version: 2.9.2 Airflow plugin version: acryl-datahub-airflow-plugin==1.0.0.2

Linux-oiD avatar Apr 29 '25 13:04 Linux-oiD

Worth adding that Airflow 3 is already out

ms32035 avatar May 01 '25 14:05 ms32035

May I take this issue and submit a PR?

harishkesavarao avatar May 04 '25 15:05 harishkesavarao

@harishkesavarao go for it

Note that you might end up with complex dependency issues. I'm worried that we may need to drop support support for Airflow <2.7 in order to make this work, although that might be acceptable since those are pretty low usage at this point.

hsheth2 avatar May 06 '25 18:05 hsheth2

Sounds good @hsheth2. Agree with the dependency issues, and with dropping support for Airflow < 2.7. I plan on testing as extensively as possible once the change is made. Is there anything else that you recommend in addition to it?

harishkesavarao avatar May 07 '25 01:05 harishkesavarao

@harishkesavarao it might make sense to do two separate PRs - one dropping support, and the other improving compat with Airflow 2.7+

hsheth2 avatar May 08 '25 20:05 hsheth2

Sounds good, let me open the PRs and get back. Thanks for the comment!

harishkesavarao avatar May 11 '25 15:05 harishkesavarao

Sorry for the delay, will pick this up shortly. Edit: Working on it actively.

harishkesavarao avatar May 22 '25 14:05 harishkesavarao

closing this since https://github.com/datahub-project/datahub/issues/13357 is merged, thanks!

yoonhyejin avatar Jul 07 '25 12:07 yoonhyejin

@yoonhyejin hello. Looks like your comment contains link to the issue itself. Also if I understand @harishkesavarao correctly. The https://github.com/datahub-project/datahub/pull/13619 - is only the 1-st part of work to resolve this issue. There was no actual changes in the plugin codebase, only bump the minimum supported version of Airflow.

Linux-oiD avatar Jul 07 '25 14:07 Linux-oiD

That’s correct - this is still open.

hsheth2 avatar Jul 07 '25 14:07 hsheth2

@Linux-oiD @hsheth2 @ms32035 Would the second part of the fix be to change: "openlineage-airflow>=1.2.0,<=1.25.0" -> "openlineage-airflow>=1.7.1,<=2.5.0" here? Or I am thinking should we could keep a more restrictive upper version limit?

harishkesavarao avatar Jul 07 '25 17:07 harishkesavarao

@harishkesavarao the next step would be to remove the dep on openlineage-airflow and replace it with apache-airflow-providers-openlineage https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html

hsheth2 avatar Jul 07 '25 17:07 hsheth2

Thank you @hsheth2, I will work on a PR in the coming days. Are there any items to watch out for (dependencies, tests as well as the approach)?

harishkesavarao avatar Jul 09 '25 17:07 harishkesavarao

@harishkesavarao the main thing to watch out for is to ensure that all of the tests continue to work, and should not have any changes to the metadata produced by the integration.

hsheth2 avatar Jul 09 '25 17:07 hsheth2

@harishkesavarao @hsheth2 are there still plans to roll out a change for this soon? The open lineage dependencies blocks us from migrating to Airflow 3 and we would like the datahub plugins to work.

yuefeng-zhu avatar Jul 16 '25 15:07 yuefeng-zhu

@yuefeng-zhu, sorry about the delay.

I am working on it and will send out a PR soon. What timeline are we looking at?

On Wed, 16 Jul 2025 at 9:29 PM, yuefeng-zhu @.***> wrote:

yuefeng-zhu left a comment (datahub-project/datahub#13357) https://github.com/datahub-project/datahub/issues/13357#issuecomment-3079264663

@harishkesavarao https://github.com/harishkesavarao @hsheth2 https://github.com/hsheth2 are there still plans to roll out a change for this soon? The open lineage dependencies blocks us from migrating to Airflow 3 and we would like the datahub plugins to work.

— Reply to this email directly, view it on GitHub https://github.com/datahub-project/datahub/issues/13357#issuecomment-3079264663, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZAT6E2BVOZVL5DTNXAVNGT3IZZFLAVCNFSM6AAAAAB4DB7QZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTANZZGI3DINRWGM . You are receiving this because you were mentioned.Message ID: @.***>

harishkesavarao avatar Jul 21 '25 12:07 harishkesavarao

@yuefeng-zhu, sorry about the delay.

I am working on it and will send out a PR soon. What timeline are we looking at?

Ideally within the next 2 weeks! There is a lot of need for Airflow 3 features, including event based capabilities. We also need Datahub for metadata purposes for our data! Thank you very much @harishkesavarao.

yuefeng-zhu avatar Jul 21 '25 19:07 yuefeng-zhu

@harishkesavarao the next step would be to remove the dep on openlineage-airflow and replace it with apache-airflow-providers-openlineage https://airflow.apache.org/docs/apache-airflow-providers-openlineage/stable/index.html

@hsheth2 As I am implementing the changes, I want to clarify a couple items, since you have done most of the implementation for the existing plugin:

  • I have changed, "openlineage-airflow>=1.2.0,<=1.30.1" -> "apache-airflow-providers-openlineage>=1.1.0,<2.5.0" in setup.py
  • Now, I see that the plugin is being used in: _extractors.py, datahub_listener.py. I believe these imports need to begin using the apache-airflow-providers-openlineage plugin. Can you please confirm if this looks like the right direction?

Specifically,

from openlineage.airflow.listener import TaskHolder
from openlineage.airflow.utils import redact_with_exclusions
from openlineage.client.serde import Serde

and

from openlineage.airflow.extractors import (
    BaseExtractor,
    ExtractorManager as OLExtractorManager,
    TaskMetadata,
)
from openlineage.airflow.extractors.snowflake_extractor import SnowflakeExtractor
from openlineage.airflow.extractors.sql_extractor import SqlExtractor
from openlineage.airflow.utils import get_operator_class, try_import_from_string
from openlineage.client.facet import (
    ExtractionError,
    ExtractionErrorRunFacet,
    SqlJobFacet,
)

harishkesavarao avatar Jul 27 '25 17:07 harishkesavarao

Yup that looks about right. The other thing to be careful of is to make sure everything ports over properly and all the tests continue to work as expected.

hsheth2 avatar Jul 27 '25 22:07 hsheth2

@hsheth2 a question on the datahub_listener, I may be completely off track here, but looking at the apache-airflow-providers-openlineage plugin, my understanding is that we do not have to keep track of tasks within the datahub_listener. I am looking at the apache_airflow_providers_openlineage-2.5.0 code and it looks like we can just call the methods within the OpenLineageListener class.

@yuefeng-zhu just letting you know that I am working through this and will try and make as much progress as possible at the earliest.

harishkesavarao avatar Jul 30 '25 17:07 harishkesavarao

Not quite - our listener class is very similar to the OpenLineageListener class, but does a bit more. If you look through the code you'll see the differences (different extractor, format messages for DataHub instead of OL, etc.). Those differences / additional capabilities are important to preserve, but we should be integrating the other OL changes into our listener. However, you are probably correct that we can remove the TaskHolder.

hsheth2 avatar Jul 30 '25 20:07 hsheth2

@harishkesavarao @hsheth2 do we have an ETA on this change?

yuefeng-zhu avatar Aug 08 '25 19:08 yuefeng-zhu

@yuefeng-zhu, not at the moment. I am actively working on it. The changes are quite extensive, given we are handling the listener for Airflow 3 and beyond but also preserving backward compatibility. I understand that you are waiting for the change and I will keep you posted.

You can follow the updates here: https://github.com/harishkesavarao/datahub/blob/fix/airflow-plugin/metadata-ingestion-modules/airflow-plugin/src/datahub_airflow_plugin/datahub_listener.py

On Sat, 9 Aug 2025 at 12:42 AM, yuefeng-zhu @.***> wrote:

yuefeng-zhu left a comment (datahub-project/datahub#13357) https://github.com/datahub-project/datahub/issues/13357#issuecomment-3169050691

@harishkesavarao https://github.com/harishkesavarao @hsheth2 https://github.com/hsheth2 do we have an ETA on this change?

— Reply to this email directly, view it on GitHub https://github.com/datahub-project/datahub/issues/13357#issuecomment-3169050691, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZAT6E5DKRM76APD2AVYNAT3MTZCBAVCNFSM6AAAAAB4DB7QZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCNRZGA2TANRZGE . You are receiving this because you were mentioned.Message ID: @.***>

harishkesavarao avatar Aug 09 '25 01:08 harishkesavarao

@yuefeng-zhu I will try my best to submit a PR by the weekend.

harishkesavarao avatar Aug 14 '25 18:08 harishkesavarao

@hsheth2 I believe the conversion to the Airflow plugin for openlineage presents an opportunity to use the Airflow extractors as well. The Airflow extractor manager does not have child classes for Snowflake, Bigquery or any custom extractor.

What are your thoughts on implementing them in the datahub extractor?

Options:

  1. Continue to use the OL extractors for specific operators while moving away from OL to Airflow extractors for generic use.
  2. Write wrappers for Snowflake and Bigquery to allow custom extractors to use the Airflow extractors (this is a larger effort, potentially the long term option to completely move from the OL extractor)

harishkesavarao avatar Aug 15 '25 19:08 harishkesavarao

Let's keep it simple - we can continue inheriting from and extending openlineage's ExtractorManager

All of our imports should probably change from openlineage.airflow.<something> to airflow.providers.openlineage.<something>

hsheth2 avatar Aug 18 '25 18:08 hsheth2

@yuefeng-zhu I will continue to work on this and will finalize the PR. Have been having a time crunch personally, apologies.

harishkesavarao avatar Aug 19 '25 15:08 harishkesavarao

I'm eager to get this patch. Let me know if I can help in any way.

sorenarchibald avatar Aug 22 '25 19:08 sorenarchibald

@sorenarchibald, thank you for offering. I would appreciate that. Especially, I need some help with testing the refactoring of the extractors to use this instead of this.

harishkesavarao avatar Aug 22 '25 23:08 harishkesavarao

@sorenarchibald @yuefeng-zhu @ne1r0n I am working on this actively and making progress, just wanted to give an update.

harishkesavarao avatar Aug 26 '25 11:08 harishkesavarao