kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

SparkHiveDataset is incompatible with Databricks Connect V2

Open alamastor opened this issue 1 year ago • 2 comments

Description

SparkHiveDataset.exists raises when called using a Databricks Connect V2 SparkSession.

Using kedro-plugins commit f59e930, i.e. an unreleased version, downstream of https://github.com/kedro-org/kedro-plugins/pull/352 (which adds support for DB Connect V2).

This occurs because DB Connect V2 doesn't support accessing _jsparkSession on the SparkSession, however it's used SparkHiveDataset.exists.

The obvious solution is to replace _get_spark()._jsparkSession.catalog().tableExists(self._database, self._table) with _get_spark().catalog.tableExists(self._database, self._table), however there may be a reason _jsparkSession was used that I'm not aware of.

I'm happy to raise a PR with this change.

Context

Use SparkHiveDataset with Databricks connect V2.

Steps to Reproduce

  1. Intstall kedro-plugins from master / a commit downstream of https://github.com/kedro-org/kedro-plugins/pull/352
  2. Setup Databricks Connect per https://docs.databricks.com/en/dev-tools/databricks-connect/python/install.html
  3. Use a SparkHiveDataset

Expected Result

The dataset doesn't raise when calling _exists (works with Databricks connect V1)

Actual Result

[JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jsparkSession` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session.

alamastor avatar Dec 08 '23 00:12 alamastor

I've also encountered this. catalog.tableExists only was introduced in spark 3.3, so making this change will break some backwards compatibility (current constraint is pyspark>=2.2). The datasets itself require Python 3.9. This makes that the effective lower bound is pyspark>3 already. I'm in favour of upgrading.

sbrugman avatar Dec 20 '23 21:12 sbrugman

We welcome PR contributions to fix this!

merelcht avatar Jul 08 '24 14:07 merelcht