kedro-plugins SparkHiveDataset is incompatible with Databricks Connect V2

SparkHiveDataset is incompatible with Databricks Connect V2

Open alamastor opened this issue 1 year ago • 2 comments

Description

SparkHiveDataset.exists raises when called using a Databricks Connect V2 SparkSession.

Using kedro-plugins commit f59e930, i.e. an unreleased version, downstream of https://github.com/kedro-org/kedro-plugins/pull/352 (which adds support for DB Connect V2).

This occurs because DB Connect V2 doesn't support accessing _jsparkSession on the SparkSession, however it's used SparkHiveDataset.exists.

The obvious solution is to replace _get_spark()._jsparkSession.catalog().tableExists(self._database, self._table) with _get_spark().catalog.tableExists(self._database, self._table), however there may be a reason _jsparkSession was used that I'm not aware of.

I'm happy to raise a PR with this change.

Context

Use SparkHiveDataset with Databricks connect V2.

Steps to Reproduce

Intstall kedro-plugins from master / a commit downstream of https://github.com/kedro-org/kedro-plugins/pull/352
Setup Databricks Connect per https://docs.databricks.com/en/dev-tools/databricks-connect/python/install.html
Use a SparkHiveDataset

Expected Result

The dataset doesn't raise when calling _exists (works with Databricks connect V1)

Actual Result

[JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jsparkSession` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session.

Dec 08 '23 00:12 alamastor

I've also encountered this. catalog.tableExists only was introduced in spark 3.3, so making this change will break some backwards compatibility (current constraint is pyspark>=2.2). The datasets itself require Python 3.9. This makes that the effective lower bound is pyspark>3 already. I'm in favour of upgrading.

Dec 20 '23 21:12 sbrugman

We welcome PR contributions to fix this!

Jul 08 '24 14:07 merelcht

kedro-plugins kedro-plugins copied to clipboard

SparkHiveDataset is incompatible with Databricks Connect V2

Description

Context

Steps to Reproduce

Expected Result

Actual Result

kedro-plugins
kedro-plugins copied to clipboard