`enable_mosaic` fails to add Mosaic to the pySpark session
Describe the bug
We used Databricks for running workflows, but GitHub actions for some unit tests. When trying to use Mosaic outside of a Databricks notebook the enable_mosaic(spark) step fails without adding workarounds just for the tests.
This has already been reported in https://github.com/databrickslabs/mosaic/issues/115. I'm opening a new issue as that issue was closed, but the bug still exists as far as I can tell.
To Reproduce Steps to reproduce the behavior:
- Setup a pySpark project, in my example I used the following versions
[tool.poetry.dependencies]
python = "3.9.5"
databricks-mosaic = "0.3.9"
pyspark = "3.3.2"
- Try and enable mosaic
from pyspark.sql import SparkSession
from mosaic import enable_mosaic
spark = (
SparkSession
.builder
.master("local[*]")
.getOrCreate()
)
enable_mosaic(spark)
- This will result in an error and fail to attach Mosaic to the Spark session:
#> TypeError Traceback (most recent call last)
#> /Users/willbowditch/projects/mosaic-test/example.py in line 4
#> 2 from mosaic import enable_mosaic
#> 3 spark = SparkSession.builder.getOrCreate()
#> ----> 4 enable_mosaic(spark)
#>
#> File ~/projects/mosaic-test/.venv/lib/python3.9/site-packages/mosaic/api/enable.py:47, in enable_mosaic(spark, dbutils)
#> 14 """
#> 15 Enable Mosaic functions.
#> 16
#> (...)
#> 44
#> 45 """
#> 46 config.mosaic_spark = spark
#> ---> 47 _ = MosaicLibraryHandler(config.mosaic_spark)
#> 48 config.mosaic_context = MosaicContext(config.mosaic_spark)
#> 50 # Register SQL functions
#>
#> File ~/projects/mosaic-test/.venv/lib/python3.9/site-packages/mosaic/core/library_handler.py:29, in MosaicLibraryHandler.__init__(self, spark)
#> 25 raise FileNotFoundError(
#> 26 f"Mosaic JAR package {self._jar_filename} could not be located at {self.mosaic_library_location}."
#> 27 )
#> 28 LOGGER.info(f"Automatically attaching Mosaic JAR to cluster.")
#> ---> 29 self.auto_attach()
#>
#> File ~/projects/mosaic-test/.venv/lib/python3.9/site-packages/mosaic/core/library_handler.py:82, in MosaicLibraryHandler.auto_attach(self)
#> 77 converters = self.sc._jvm.scala.collection.JavaConverters
#> 79 JarURI = JavaURI.create("file:" + self._jar_path)
#> 80 lib = JavaJarId(
#> 81 JarURI,
#> ---> 82 ManagedLibraryId.defaultOrganization(),
#> 83 NoVersionModule.simpleString(),
#> 84 )
#> 85 libSeq = converters.asScalaBufferConverter((lib,)).asScala().toSeq()
#> 87 context = DatabricksILoop.getSharedDriverContextIfExists().get()
#>
#> TypeError: 'JavaPackage' object is not callable
Expected behavior
Mosaic should activated successfully.
Possible cause
I think the problem is with the automatic activation code here: https://github.com/databrickslabs/mosaic/blob/74a55e96990417699271de30d52a5f2d6e0c2df9/python/mosaic/core/library_handler.py#L82-L85
I presume it is failing as that is reliant on the internal com.dataBricks Scala libraries being available, which are not available locally. It is possible to workaround this by finding the path of the Jar and disabling auto-mounting, but it would be much easier to write unit tests if that wasn't necessary.
Could this section be made to work both on and off DataBricks?
Hi @willbowditch,
Mosaic strongly depends on Databricks Runitime and this is the reason for the issue you are experiencing. There is a pattern for running integration tests on a connected DBR through GitHub actions. I will create a documentation page with examples for that type of CI tests.
I will provide a link to examples here. I will look into creating test APIs that are decoupled - however that may not work for all APIs, it may require Mock Subs and it could take a while to produce a test aretefacts.
KR Milos
Thanks @milos-colic, that makes sense, I'll take a look into running the tests within the runtime.
Could you share any more details on which parts of moasic are coupled to the DataBricks runtime and which are not?
My current project, which only uses a few mosaic functions
from mosaic import (
enable_mosaic,
grid_boundaryaswkb,
grid_tessellateexplode,
st_area,
st_aswkt,
st_geomfromwkt,
)
now works fine on standard Spark, once the above enable_mosaic step is fixed. But it would be good to understand if this issue is likely to come up again if we bring other parts of mosaic into the codebase. Thanks