sedona
sedona copied to clipboard
Save / Load indexed spatial & partitioned Rdd
Expected behavior
Maybe this is possible somehow, but I haven't find this anywhere. I'm relatively new to Sedona and Geo-processing. I'd like to see a possibility to save and then load a spatial RDD which is already analyzed, partitioned and possibly with the index. We have a use case we use such dataset in many jobs (which use the same spatial data) and it's time-consuming to create the partitioning & build index every time. Not sure if it's possible though.
For example:
// save once:
val spatialRdd = Adapter.toSpatialRdd(df, ...)
spatialRdd.analyze()
spatialRdd.spatialPartitioning(GridType.KDBTREE, math.min(Integer.MAX_VALUE, df.count() / 2).toInt) // IllegalArgumentException: [Sedona] Number of partitions cannot be larger than half of total records num
spatialRdd.buildIndex(IndexType.RTREE, true)
SomeSedonaUtility.saveSpatialRdd(spatialRdd, path) // <-- save with index and partitioned
// load & use multiple times:
val rdd = SomeSedonaUtility.loadSpatialRdd(path)
// and usage:
val otherRdd = Adapter.toSpatialRdd(otherDs, ...)
otherRdd.spatialPartitioning(rdd.getPartitioner)
val useIndex = true
val considerBoundaryIntersection = SpatialPredicate.COVERS
val params = new JoinQuery.JoinParams(useIndex, considerBoundaryIntersection, IndexType.RTREE, JoinBuildSide.LEFT)
val joined = JoinQuery.spatialJoin(rdd, otherRdd, params)
Actual behavior
Index & partitioning must be set at runtime (to my knowledge).
Steps to reproduce the problem
The feature is missing, so it's not possible to reproduce it.
Settings
Sedona version = 1.5.1
Apache Spark version = 3.5
API type = Scala
Scala version = 2.12
JRE version = 1.8
Environment = EMR
@vbmacher Unfortunately, a spatial partitioned RDD cannot be saved and loaded back because it will lead to wrong results. See the explanation here: https://sedona.apache.org/1.5.1/tutorial/rdd/#save-an-spatialrdd-spatialpartitioned-wo-indexed
Thanks @jiayuasu, so I read there also it is possible to save indexed rdd (https://sedona.apache.org/1.5.1/tutorial/rdd/#save-an-spatialrdd-indexed), but to my knowledge building an index requires spatial partitioning. So when I save the indexed RDD and then reload it back, there won't be partitioning set up but index will work ?
Also I'd like to know more details on this one, if possible:
We are working on some solutions. Stay tuned!
Is it something which we can expect maybe next release? Thanks!