sedona icon indicating copy to clipboard operation
sedona copied to clipboard

[SEDONA-132] Move some functions to a common module

Open Kimahriman opened this issue 3 years ago • 14 comments

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

  • Yes, the URL of the assoicated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-132.

What changes were proposed in this PR?

Begin creating a new module with common SQL functions across Spark and Flink

How was this patch tested?

Existing UTs

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the docs.

Kimahriman avatar Jul 05 '22 22:07 Kimahriman

Started playing around with this and wanted to get some thoughts. Figured out some Scala magic to reduce the boilerplate needed for each Spark function. Only converted a few to get initial thoughts. Maybe can agree on an initial set to start with and others could slowly convert existing ones, and any new ones get added in the common module.

Kimahriman avatar Jul 05 '22 22:07 Kimahriman

@Kimahriman This sounds like a good idea to me! But there are way more functions need to move to this Sedona-common, such as spatial partitioning, serializer, format reader... I believe this will take lots of efforts

jiayuasu avatar Jul 09 '22 07:07 jiayuasu

@netanel246 @Imbruced @yitao-li Any opinions?

jiayuasu avatar Jul 09 '22 07:07 jiayuasu

@Kimahriman This sounds like a good idea to me! But there are way more functions need to move to this Sedona-common, such as spatial partitioning, serializer, format reader... I believe this will take lots of efforts

Yeah I just wanted to get the conversation started, figure out how to maybe break this up so it's not a massive effort for all the things in one PR, but maybe is done piece by piece (and at least start any new things in the common setup).

Kimahriman avatar Jul 09 '22 14:07 Kimahriman

Doing it piece by piece is probably a good idea but it would be nice to have a clear end goal.

Spatial partitioning and the kryoserializer are not intertwined with other apis and should be pretty easy to move. I don't think it's strictly nessesary to move format readers. Some ST-constructors depend on format readers. They could be refactored to call JTS and geojson libraries directly. Then the format readers could use the functions in common where it makes sense.

umartin avatar Jul 18 '22 13:07 umartin

Doing it piece by piece is probably a good idea but it would be nice to have a clear end goal.

Spatial partitioning and the kryoserializer are not intertwined with other apis and should be pretty easy to move. I don't think it's strictly nessesary to move format readers. Some ST-constructors depend on format readers. They could be refactored to call JTS and geojson libraries directly. Then the format readers could use the functions in common where it makes sense.

How about limiting this PR to the "Functions" defined in Flink? And then make separate issues for other things Flink already has (predicates, constructors), and then other things can be done as they are used/added to Flink

Kimahriman avatar Jul 18 '22 16:07 Kimahriman

Doing it piece by piece is probably a good idea but it would be nice to have a clear end goal. Spatial partitioning and the kryoserializer are not intertwined with other apis and should be pretty easy to move. I don't think it's strictly nessesary to move format readers. Some ST-constructors depend on format readers. They could be refactored to call JTS and geojson libraries directly. Then the format readers could use the functions in common where it makes sense.

How about limiting this PR to the "Functions" defined in Flink? And then make separate issues for other things Flink already has (predicates, constructors), and then other things can be done as they are used/added to Flink

I agree. Let's start with the functions in this PR. And then create other PRs for other functions.

jiayuasu avatar Jul 18 '22 20:07 jiayuasu

Doing it piece by piece is probably a good idea but it would be nice to have a clear end goal. Spatial partitioning and the kryoserializer are not intertwined with other apis and should be pretty easy to move. I don't think it's strictly nessesary to move format readers. Some ST-constructors depend on format readers. They could be refactored to call JTS and geojson libraries directly. Then the format readers could use the functions in common where it makes sense.

How about limiting this PR to the "Functions" defined in Flink? And then make separate issues for other things Flink already has (predicates, constructors), and then other things can be done as they are used/added to Flink

I agree. Let's start with the functions in this PR. And then create other PRs for other functions.

jiayuasu avatar Jul 18 '22 20:07 jiayuasu

Ok I think I got all the Flink Functions. Rewriting the geohash logic in java was a bit of a pain, I wish everything could just be in Scala :sweat_smile: I didn't try to change any tests, was just gonna let the flink/spark-sql tests keep checking the logic for now

Kimahriman avatar Jul 20 '22 16:07 Kimahriman

@Kimahriman Do you think we should release sedona-common as a separate module or it is automatically packaged in other modules?

jiayuasu avatar Jul 23 '22 00:07 jiayuasu

So I was actually just playing around with that now that I think I understand how the modules currently work. Basically every dependency is provided scope except for python-adapter, which has basically everything bundled inside of it. Is there a reason it's setup that way? Any reason not to change it to what I think is the more traditional approach of just having compile scope dependencies? Except for the big stuff like spark, Hadoop, and Flink (and geotools for license reasons)

Accidentally hit close merge request hah

Kimahriman avatar Jul 23 '22 03:07 Kimahriman

@Kimahriman Users were supposed to call Sedona modules individually. They can mix and match different modules. Each module does not have any compile scope dependencies. This is great for Scala/Java experts to easily manage dependencies.

However, when @Imbruced initially designed the python-adapter for Sedona PySpark, he suggests that we should provide a fat jar solution for Python users because people in Python world are not familiar with these Maven packaging stuff. Therefore, we decided to put all dependencies (including all Sedona modules, JTS, GeoJson, excluding Geotools) into sedona-python-adapter.

This python-adapter was supposed to be only used by Python users but later we found that many Scala/Java/R users also suffer from these packaging stuff. So on Sedona website, we recommend that new comers should use python-adapter jar unless they are confident about the packaging stuff.

jiayuasu avatar Jul 23 '22 04:07 jiayuasu

Yeah that's what I ended up doing all the time, I just include the python-adapter and then it automatically includes all the dependencies I need, which is a little odd. In fact, python-adapter has things bundled inside of it and has compile scope dependencies on all the things, so you end up double including all of the classes.

From a Spark perspective (which is all I do, not Flink, and via Python, so not a Java/Scala/Maven expert by any means), I think the two main approaches to including dependencies are via --packages (or spark.jars.packages), which resolves and automatically includes compile scope dependencies, or creating a fat Jar if you're spark-submitting java/scala code. Either way lends itself to compile scope dependencies correctly, versus users having to know what transitive dependencies are required for them to manually include (and what version they should use). So if everything was just compile scope, and you just wanted to do Sedona SQL things, you could just do --packages org.apache.sedona:sedona-sql-3.0_2.12:1.2.0 and have everything you need without worrying about anything else.

Would you be open to switching to that type of setup? Then the "common" module would just be another compile scope module that gets automatically included for whatever needs it.

Kimahriman avatar Jul 23 '22 13:07 Kimahriman

@Kimahriman Yes, I think it will be good to make Sedona-common as a compile scope dependency to whatever package needs it. So the user no need to include it manually.

As in this PR,

sedona-core pom: needs sedona-common, as a compile scope dependency sedona-sql pom: no update needed because it needs sedona-core (provided scope) for spatial join and kyro serializer sedona-viz pom: no update needed because it needs sedona-core ans sedona-sql (provided scope) for spatial join and kyro serializer sedona-python-adapter pom: no update needed because it needs sedona-core and sedona-sql (compile scope) sedona-flink: needs sedona-common as a compile scope dependency. It currently also needs sedona-core and sedona-sql (provided scope). Note that: sedona-core and sedona-sql dependencies will eventually be dropped as we move all common functions to sedona-common in the next few PRs).

Can you update the PR? Then I will merge this PR.

Eventually I plan to release this PR to Sedona 1.3.0, not the next release 1.2.1. I think this PR is a rather big change to the project structure and is better not put in a maintenance release like 1.2.1

jiayuasu avatar Jul 25 '22 03:07 jiayuasu

This broke the build for me because of the wrong/not matching version in: https://github.com/apache/incubator-sedona/blob/f2e61b85dc95235bffd5aa49a8db35ad0735f1a7/common/pom.xml#L25 It should be <version>1.3.0-incubating-SNAPSHOT</version>.

Elephantusparvus avatar Aug 16 '22 12:08 Elephantusparvus

This broke the build for me because of the wrong/not matching version in:

https://github.com/apache/incubator-sedona/blob/f2e61b85dc95235bffd5aa49a8db35ad0735f1a7/common/pom.xml#L25

It should be <version>1.3.0-incubating-SNAPSHOT</version>.

Yep this never got updated after the 1.2.1 release got cut, can you update it in your PR?

Kimahriman avatar Aug 16 '22 13:08 Kimahriman

This broke the build for me because of the wrong/not matching version in: https://github.com/apache/incubator-sedona/blob/f2e61b85dc95235bffd5aa49a8db35ad0735f1a7/common/pom.xml#L25

It should be <version>1.3.0-incubating-SNAPSHOT</version>.

Yep this never got updated after the 1.2.1 release got cut, can you update it in your PR?

Did so, see https://github.com/apache/incubator-sedona/pull/662/commits/231f98b5016ec7fc0f541593343c5202dace43f3

Elephantusparvus avatar Aug 16 '22 14:08 Elephantusparvus