Adding airflow operator to submit and monitor Spark Apps/App Env on DPGDC clusters
Adding airflow operator to create and monitor Spark Apps/App Env on DPGDC clusters
GDC extends Google Cloud’s infrastructure and services to customer edge locations and data centers. We intend to bring Dataproc as a managed service offering to GDC.
As part of this, we need to integrate Airflow with Dataproc on GDC API resources to trigger Apache Spark workloads. The integration needs to cover two distinct API paths: local execution through the KRM API and remote execution through an equivalent One Platform API. We're targeting the former one in this PR i.e. Airflow operator leveraging KRM APIs.
This PR contains two feature implementations :
- An Airflow operator can submit a SparkApplication via the KRM API, enabling customers to use Airflow to orchestrate Dataproc on GDC workloads locally.
- An Airflow operator can create an ApplicationEnvironment via both the KRM API.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst) Here are some useful points:
- Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
- In case of a new feature add useful documentation (in docstrings or in
docs/directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it. - Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
- Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
- Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
- Be sure to read the Airflow Coding style.
- Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits. Apache Airflow is a community-driven project and together we are making it better 🚀. In case of doubts contact the developers at: Mailing List: [email protected] Slack: https://s.apache.org/airflow-slack
Hey Hussein, Dataproc on GDC is yet to go GA and this is one of the critical features to add as part of the same. Hence we don't have a public doc on the same. You can take a look at this blog post though: https://cloud.google.com/blog/products/infrastructure-modernization/google-distributed-cloud-new-ai-and-data-services
The CRD for DPGDC is completely different than the sparkoperator.k8s.io/v1beta2, which stops us from leveraging the same operator/sensor. using KRM APIs is one of the ways to interact with a DPGDC cluster. In future, we plan to add the operator similar to dataproc.py(ex: DataprocSubmitSparkJobOperator) which will leverage Google's internal One Platform API mechanism.
I didn't find anything about this in the GCP documentation, could you please add the documentation link?
Is the CRD
dataprocgdc.cloud.google.com/v1alpha1based on onsparkoperator.k8s.io/v1beta2? I'm asking because we already have two operators for thespark-on-k8s-operator, so maybe we can use one of them as a superclass to your operator to avoid code duplication and implementing everything from scratch.
Hey Hussein, Dataproc on GDC is yet to go GA and this is one of the critical features to add as part of the same. Hence we don't have a public doc on the same. You can take a look at this blog post though: https://cloud.google.com/blog/products/infrastructure-modernization/google-distributed-cloud-new-ai-and-data-services
It will be a bit complicated to review this PR without a doc, I will try to review the syntax and check if our conventions are respected. (cc: @eladkal could you take a look?)
The CRD for DPGDC is completely different than the sparkoperator.k8s.io/v1beta2, which stops us from leveraging the same operator/sensor. using KRM APIs is one of the ways to interact with a DPGDC cluster. In future, we plan to add the operator similar to dataproc.py(ex: DataprocSubmitSparkJobOperator) which will leverage Google's internal One Platform API mechanism.
No problem with that, my goal was to reduce code duplication if possible, but since they are different, we can keep building your operator from scratch.
Hi @akashsriv07,
I will propose a different approach for the folder/module structure. Since GDC might have different APIs, or different inner working mechanisms, it can be good to separate the GDC related things (like hooks, operators, sensors) from the cloud one. In the future, if you want to add more things (like hooks, operators, sensors), it can be easily extended in a more structured way. Also, you will have access for the existing Google Cloud operators, sensors, etc.
Hence, I am suggesting creating a new module inside a google provider (at the same level with the cloud module). Something like the following:
airflow/providers/google
├── ads
├── cloud
├── common
├── firebase
├── gdc # <-- this is the new one
├── leveldb
├── marketing_platform
├── suite
├── CHANGELOG.rst
├── go_module_utils.py
├── __init__.py
└── provider.yaml
What do you think?
Hi @akashsriv07,
I will propose a different approach for the folder/module structure. Since GDC might have different APIs, or different inner working mechanisms, it can be good to separate the GDC related things (like
hooks,operators,sensors) from thecloudone. In the future, if you want to add more things (likehooks,operators,sensors), it can be easily extended in a more structured way. Also, you will have access for the existing Google Cloudoperators,sensors, etc.Hence, I am suggesting creating a new module inside a
cloudmodule). Something like the following:airflow/providers/google ├── ads ├── cloud ├── common ├── firebase ├── gdc # <-- this is the new one ├── leveldb ├── marketing_platform ├── suite ├── CHANGELOG.rst ├── go_module_utils.py ├── __init__.py └── provider.yamlWhat do you think?
Hey @molcay , GDC is part of Google Cloud only. Shouldn't it be a part of it?
Hey @akashsriv07,
When I read the following sentence on the product page, I thought that this product is not directly in the GCP, rather it is equivalent to the GCP but it is for on-premise infrastructure:
Build, deploy, and scale modern industry & public sector applications on-premise with AI ready modern infrastructure, ensure data security, and enable agile developer workflows across edge locations & data centers.
But of course, I am not entirely sure on the topic. This suggestion folds on a gray area I guess. On the other hand, at the base level, maybe we can group the changes related to GDC in a single module and put it inside of the cloud module.
airflow/providers/google/cloud
├── example_dags
├── fs
├── gdc # <-- this is the new one
├── hooks
├── __init__.py
├── _internal_client
├── links
├── log
├── operators
├── __pycache__
├── secrets
├── sensors
├── transfers
├── triggers
└── utils
Also, maybe it is too early to have this module structure right now. We can also wait and see how it will extend
Hey, Michal from Cloud Composer team here. +1 for this code living in a separate, GDC provider - GDC is based on slightly different infra than GCP, and the operators won't necessarily work in the same way (some might be very similar, some very different). If there's shared code, I propose having a util module/provider between these two. FWIW we plan to separate out more Google providers (e.g. taking out ads to a separate one) to decouple support of Google services a bit, instead of having them all in one provider, at times unnecessarily tangling e.g. dependencies.
Hey, Michal from Cloud Composer team here. +1 for this code living in a separate, GDC provider - GDC is based on slightly different infra than GCP, and the operators won't necessarily work in the same way (some might be very similar, some very different).
Given the above statemant and the fact that we are going to seprate google to several providers we should treat this PR like adding a new provider which means it needs to go through the approval cycle like all any other new provider. https://github.com/apache/airflow/blob/main/PROVIDERS.rst#accepting-new-community-providers
If there's shared code, I propose having a util module/provider between these two.
That will not work and this is one of the reason why seperating Google provider is not an easy task. You can't have utill shared between two providers. Utill belongs to one provider. Maybe Google can intoduce google.common provider but that would require separate discussion. I really advise against adding more completely to Google provider it will make things harder for the decouple project
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.
Setting request changes to avoid accedental merge
Will explore the feasibility on creating the provider and the process to be followed. Thanks
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.