mosaic icon indicating copy to clipboard operation
mosaic copied to clipboard

`enable_mosaic` fails to add Mosaic to the pySpark session

Open willbowditch opened this issue 2 years ago • 2 comments

Describe the bug We used Databricks for running workflows, but GitHub actions for some unit tests. When trying to use Mosaic outside of a Databricks notebook the enable_mosaic(spark) step fails without adding workarounds just for the tests.

This has already been reported in https://github.com/databrickslabs/mosaic/issues/115. I'm opening a new issue as that issue was closed, but the bug still exists as far as I can tell.

To Reproduce Steps to reproduce the behavior:

  1. Setup a pySpark project, in my example I used the following versions
[tool.poetry.dependencies]
python = "3.9.5"
databricks-mosaic = "0.3.9"
pyspark = "3.3.2"
  1. Try and enable mosaic
from pyspark.sql import SparkSession
from mosaic import enable_mosaic

spark = (
    SparkSession
    .builder
    .master("local[*]")
    .getOrCreate()
)

enable_mosaic(spark)
  1. This will result in an error and fail to attach Mosaic to the Spark session:
#> TypeError                                 Traceback (most recent call last)
#> /Users/willbowditch/projects/mosaic-test/example.py in line 4
#>       2 from mosaic import enable_mosaic
#>       3 spark = SparkSession.builder.getOrCreate()
#> ----> 4 enable_mosaic(spark)
#> 
#> File ~/projects/mosaic-test/.venv/lib/python3.9/site-packages/mosaic/api/enable.py:47, in enable_mosaic(spark, dbutils)
#>      14 """
#>      15 Enable Mosaic functions.
#>      16 
#>    (...)
#>      44 
#>      45 """
#>      46 config.mosaic_spark = spark
#> ---> 47 _ = MosaicLibraryHandler(config.mosaic_spark)
#>      48 config.mosaic_context = MosaicContext(config.mosaic_spark)
#>      50 # Register SQL functions
#> 
#> File ~/projects/mosaic-test/.venv/lib/python3.9/site-packages/mosaic/core/library_handler.py:29, in MosaicLibraryHandler.__init__(self, spark)
#>      25     raise FileNotFoundError(
#>      26         f"Mosaic JAR package {self._jar_filename} could not be located at {self.mosaic_library_location}."
#>      27     )
#>      28 LOGGER.info(f"Automatically attaching Mosaic JAR to cluster.")
#> ---> 29 self.auto_attach()
#> 
#> File ~/projects/mosaic-test/.venv/lib/python3.9/site-packages/mosaic/core/library_handler.py:82, in MosaicLibraryHandler.auto_attach(self)
#>      77 converters = self.sc._jvm.scala.collection.JavaConverters
#>      79 JarURI = JavaURI.create("file:" + self._jar_path)
#>      80 lib = JavaJarId(
#>      81     JarURI,
#> ---> 82     ManagedLibraryId.defaultOrganization(),
#>      83     NoVersionModule.simpleString(),
#>      84 )
#>      85 libSeq = converters.asScalaBufferConverter((lib,)).asScala().toSeq()
#>      87 context = DatabricksILoop.getSharedDriverContextIfExists().get()
#> 
#> TypeError: 'JavaPackage' object is not callable

Expected behavior

Mosaic should activated successfully.

Possible cause

I think the problem is with the automatic activation code here: https://github.com/databrickslabs/mosaic/blob/74a55e96990417699271de30d52a5f2d6e0c2df9/python/mosaic/core/library_handler.py#L82-L85

I presume it is failing as that is reliant on the internal com.dataBricks Scala libraries being available, which are not available locally. It is possible to workaround this by finding the path of the Jar and disabling auto-mounting, but it would be much easier to write unit tests if that wasn't necessary.

Could this section be made to work both on and off DataBricks?

willbowditch avatar Apr 18 '23 16:04 willbowditch

Hi @willbowditch,

Mosaic strongly depends on Databricks Runitime and this is the reason for the issue you are experiencing. There is a pattern for running integration tests on a connected DBR through GitHub actions. I will create a documentation page with examples for that type of CI tests.

I will provide a link to examples here. I will look into creating test APIs that are decoupled - however that may not work for all APIs, it may require Mock Subs and it could take a while to produce a test aretefacts.

KR Milos

milos-colic avatar Apr 19 '23 08:04 milos-colic

Thanks @milos-colic, that makes sense, I'll take a look into running the tests within the runtime.

Could you share any more details on which parts of moasic are coupled to the DataBricks runtime and which are not?

My current project, which only uses a few mosaic functions

from mosaic import (
    enable_mosaic,
    grid_boundaryaswkb,
    grid_tessellateexplode,
    st_area,
    st_aswkt,
    st_geomfromwkt,
)

now works fine on standard Spark, once the above enable_mosaic step is fixed. But it would be good to understand if this issue is likely to come up again if we bring other parts of mosaic into the codebase. Thanks

willbowditch avatar Apr 20 '23 08:04 willbowditch