containers
containers copied to clipboard
Possibility for spark init scripts
I'm not sure whether this is the right place, but:
We are using clusters with our own containers to bundle our jars and python libs. One issue we are facing though: we are using https://sedona.apache.org/ to support geometry datatypes.
In notebooks or jobs this is fine, because we can run the required registration of datatypes and SQL functions ourselves:
SedonaRegistrator.registerAll(spark)
Is there a way to include such a code-snippet as part of the cluster-startup, e.g. for use on clusters that only provide the SQL endpoint? As I understand, the existing cluster-init scripts run before any spark-context exists, so this can't be placed there.
This is usually done via spark.sql.extensions that are registering necessary things on the Spark initialization. For Sedona, this will be available in the next version - see SEDONA-21 for more details.
The only thing that needs to be taken into account - the library providing these extensions should be available when cluster is starting - either packaged into docker image, or installed via init script
Thanks - that looks good. We'll wait for the next release.
I found this issue while trying to autoregister Sedona from our Databricks cluster. I've configured the following:
spark.sql.extensions org.apache.sedona.sql.SedonaSqlExtensions
spark.kryo.registrator org.apache.sedona.core.serde.SedonaKryoRegistrator
spark.serializer org.apache.spark.serializer.KryoSerializer
but I cannot leverage Sedona in my notebooks if I don't run the usual SedonaRegistrator.registerAll(spark)
From what I read in this issue org.apache.sedona.sql.SedonaSqlExtensions
should do it automatically, right?
I'm using Sedona 1.3.1-incubating.