kedro-plugins
kedro-plugins copied to clipboard
SparkHiveDataset is incompatible with Databricks Connect V2
Description
SparkHiveDataset.exists raises when called using a Databricks Connect V2 SparkSession.
Using kedro-plugins commit f59e930, i.e. an unreleased version, downstream of https://github.com/kedro-org/kedro-plugins/pull/352 (which adds support for DB Connect V2).
This occurs because DB Connect V2 doesn't support accessing _jsparkSession on the SparkSession, however it's used SparkHiveDataset.exists.
The obvious solution is to replace _get_spark()._jsparkSession.catalog().tableExists(self._database, self._table) with _get_spark().catalog.tableExists(self._database, self._table), however there may be a reason _jsparkSession was used that I'm not aware of.
I'm happy to raise a PR with this change.
Context
Use SparkHiveDataset with Databricks connect V2.
Steps to Reproduce
- Intstall
kedro-pluginsfrom master / a commit downstream of https://github.com/kedro-org/kedro-plugins/pull/352 - Setup Databricks Connect per https://docs.databricks.com/en/dev-tools/databricks-connect/python/install.html
- Use a
SparkHiveDataset
Expected Result
The dataset doesn't raise when calling _exists (works with Databricks connect V1)
Actual Result
[JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jsparkSession` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session.
I've also encountered this. catalog.tableExists only was introduced in spark 3.3, so making this change will break some backwards compatibility (current constraint is pyspark>=2.2). The datasets itself require Python 3.9. This makes that the effective lower bound is pyspark>3 already. I'm in favour of upgrading.
We welcome PR contributions to fix this!