airflow
airflow copied to clipboard
Extend Databricks Operators to DBFS Interaction
Description
Create operators and Hook to interact with Databricks' DBFS (https://docs.databricks.com/api/workspace/dbfs)
Use case/motivation
As per latest databricks plugin (https://github.com/apache/airflow/tree/main/airflow/providers/databricks) there is no possibility to interact with DBFS API.
As I had to do it in my job (and I have it quite developed), I thought it'd be a good idea to share it with the community
So far, I've got:
- An operator that uploads files to DBFS
- A hook that interacts with the DBFS API, respecting Databricks' Hooks logic and inheriting from
BaseDatabricksHook
As part of the PR, I'd add:
- Some more operators (getting files, getting files metadata, deleting files)
- Tests in line with Airflow's test suite
Please LMK if you consider this a relevant contribution or not
Related issues
As one of the DBFS API endpoints uses PUT as verb., I'd need to include a modification in BaseDatabricksHook, because it is not supporting PUT ATM (see https://github.com/apache/airflow/blob/main/airflow/providers/databricks/hooks/databricks_base.py#L584)
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Maybe it is also good idea to implements DBFS over Object Storage
And just wondering why not implement it over the official SDK?
Note about production usage from https://docs.databricks.com/en/dev-tools/sdk-python.html
[!NOTE] This feature is in Beta and is okay to use in production.
During the Beta period, Databricks recommends that you pin a dependency on the specific minor version of the Databricks SDK for Python that your code depends on.
Hi @Taragolis . Thx for your reply! TBH, I wasn't aware of the existence of Object Storage. It seems as if many of the things I've implemented were already there. The only thing I cannot find is some sort of cp that enables uploading/downloading data from DBFS. At this point I wonder if it wouldn't be better to extend this and then use it within ObjectStoragePath.
With respect to the SDK, sounds good to me. However, the whole plugin is done pointing directly to the REST Endpoints. I think it may be better in that sense to stick to one strategy (either change everything to point to the SDK or extend it using the REST API)
if it wouldn't be better to extend this and then use it within ObjectStoragePath
AIrflow ObjectStorage build in top of the fsspec and I guess extend some methods, like copy
With respect to the SDK, sounds good to me. However, the whole plugin is done pointing directly to the REST Endpoints
Small nit, this one about Airflow Provider, not a Airflow Plugin that is a bit different things.
I think it may be better in that sense to stick to one strategy (either change everything to point to the SDK or extend it using the REST API)
In the long run SDK should replace internal solutions, that is why I propose to use SDK over the direct call to the API
In the long run SDK should replace internal solutions, that is why I propose to use SDK over the direct call to the API
Absolutely agree on the idea! I think that's a quite deep change though and I am not sure how that's handled and if it shouldn't be actually part of another ticket (i.e., more of a refactor ticket than a feature add one)
@Taragolis @eladkal should I move forward with this as originally posted or do you have sth different in mind?