containers icon indicating copy to clipboard operation
containers copied to clipboard

Possibility for spark init scripts

Open sebbegg opened this issue 3 years ago • 3 comments

I'm not sure whether this is the right place, but:

We are using clusters with our own containers to bundle our jars and python libs. One issue we are facing though: we are using https://sedona.apache.org/ to support geometry datatypes.

In notebooks or jobs this is fine, because we can run the required registration of datatypes and SQL functions ourselves:

SedonaRegistrator.registerAll(spark)

Is there a way to include such a code-snippet as part of the cluster-startup, e.g. for use on clusters that only provide the SQL endpoint? As I understand, the existing cluster-init scripts run before any spark-context exists, so this can't be placed there.

sebbegg avatar Apr 27 '21 07:04 sebbegg

This is usually done via spark.sql.extensions that are registering necessary things on the Spark initialization. For Sedona, this will be available in the next version - see SEDONA-21 for more details.

The only thing that needs to be taken into account - the library providing these extensions should be available when cluster is starting - either packaged into docker image, or installed via init script

alexott avatar May 06 '21 13:05 alexott

Thanks - that looks good. We'll wait for the next release.

sebbegg avatar May 07 '21 04:05 sebbegg

I found this issue while trying to autoregister Sedona from our Databricks cluster. I've configured the following:

spark.sql.extensions org.apache.sedona.sql.SedonaSqlExtensions
spark.kryo.registrator org.apache.sedona.core.serde.SedonaKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer

but I cannot leverage Sedona in my notebooks if I don't run the usual SedonaRegistrator.registerAll(spark)

From what I read in this issue org.apache.sedona.sql.SedonaSqlExtensions should do it automatically, right? I'm using Sedona 1.3.1-incubating.

giohappy avatar Feb 06 '23 11:02 giohappy