astro-sdk icon indicating copy to clipboard operation
astro-sdk copied to clipboard

`aql.dataframe` should support submitting tasks to databricks

Open dimberman opened this issue 2 years ago • 0 comments

Please describe the feature you'd like to see Based on customer interviews, we have found that there are customers who would prefer to store their python code inside of their Airflow DAGs instead of as databricks notebooks.

We should support the ability to create a databricks mode for aql.dataframe so users can use the pandas API integration with databricks to send their pandas code to databricks.

example:

@aql.dataframe(dataframe_opts=DatabricksOptions(conn_id=...))
def foo(df: pd.Dataframe)
    ...

Describe the solution you'd like

To implement this feature there are two things we need: The ability to generate Databricks python files and the ability to send those files to databricks. Submitting python files to databricks is already handled in how we support autoloader in the aql.load_file function

For generating the python file I think we can take a similar approach to how airflow handles the PythonVirtualenv operator and decorator.

The python virtualenv decorate is essentially able to take an arbitrary python function, pass it into a jinja file, and load in arbitrary args and kwargs using this template file.

Once the file is generated we can load it to dbfs and then run it the same way we do with load_file.

Are there any alternatives to this feature?

The alternative in this case is for users to continue referring to databricks notebooks, which break Airflow's promise of idempotency (as a user can change their notebook and keep their DAG the same).

Additional context Add any other context about the feature request here.

Acceptance Criteria

  • [ ] All checks and tests in the CI should pass
  • [ ] Unit tests (90% code coverage or more, once available)
  • [ ] Integration tests (if the feature relates to a new database or external service)
  • [ ] Example DAG
  • [ ] Docstrings in reStructuredText for each of methods, classes, functions and module-level attributes (including Example DAG on how it should be used)
  • [ ] Exception handling in case of errors
  • [ ] Logging (are we exposing useful information to the user? e.g. source and destination)
  • [ ] Improve the documentation (README, Sphinx, and any other relevant)
  • [ ] How to use Guide for the feature (example)

dimberman avatar Mar 03 '23 02:03 dimberman