kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

`kedro-datasets`: Improve dependency management for `spark` Datasets

Open MigQ2 opened this issue 5 months ago • 1 comments

Description

I want to discuss how we could improve the way dependencies are managed for SparkDataSet and similar

Context

Currently kedro-datasets[sparks] installs pyspark, which is a >300MB monster package, and it's also probably unnecessary in most deployment setups.

When using Databricks, spark is already installed in the Databricks Cluster Runtime, and reinstalling vanilla pyspark just risks something breaking.

If working locally with Databricks, databricks-connect is probably preferred and also installing vanilla pyspark risks breaking it

Also, installing the normal kedro-datasets without the [spark] suffix doesn't work, as hdfs doesn't get installed but is needed in the code

Possible Implementation

Maybe create a kedro-datasets[light-spark] that installs lightweight dependencies like hdfs and s3fs but not pyspark itself?

Possible Alternatives

I think probably the best way would be to completely redesign datasets to work with Databricks in a modern environment, supporting properly Unity Catalog tables, Unity Catalog authentication via external tables, databricks-connect for local development and other latest Databricks functionalities

I have seen ManagedTableDataset. Maybe building on top of this to support more Databricks stuff could be a good option

MigQ2 avatar Sep 24 '24 03:09 MigQ2