sedona icon indicating copy to clipboard operation
sedona copied to clipboard

[GH-1918] Spark 4 support

Open Kimahriman opened this issue 8 months ago • 6 comments

Did you read the Contributor Guide?

Is this PR related to a ticket?

  • Yes, and the PR name follows the format [GH-XXX] my subject.

Resolves #1918

What changes were proposed in this PR?

Add support for Spark 4.

This required several updates:

  • A new profile is added for Spark 4
  • The spark/common module has source directories for Spark 3 and 4 respectively. I had to do it in the common module because things in the common module depend on the version specific shims. The main breaking changes that required this are:
    • Column objects are no longer wrappers around Expression objects, but a new ColumnNode construct for Spark Connect support. Supporting the expression wrapping requires a different setup. Initially I started working on this through reflection, but this got pretty messy and this will require different artifacts anyway, so I added the conditional source directories.
    • Creating a DataFrame from an RDD has to use the new location of the "classic" DataFrame class.
    • The NullIntolerant trait no longer exists, instead it's a an overridable function on an expression
  • jt-jiffle-language and it's antlr dependency have to be shaded into the common module for Spark 4 to work. This is because in antlr 4.10 there was some internal version bump such that dependencies compiled with antlr < 4.10 can't run at runtime with >= 4.10. I think jt-jiffle-language has an Apache license so I think this is ok? Currently it's a provided dependency that comes with the external geotools-wrapper. But need some verification here or thoughts on any alternative approach.
  • I copied the spark-3.5 module as is to spark-4.0. The only changes I had to make were to the new Arrow UDF stuff that was added recently. Could these also just be moved as conditional source directories in spark/common?
  • DBSCAN tests are ignored on Spark 4 because the current graphframes dependency does not support Spark 4. I've been messing around with getting graphframes updated as well.

How was this patch tested?

Existing UTs.

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the documentation.

Maybe supported versions need to change? Haven't looked at the docs yet.

Kimahriman avatar Apr 13 '25 00:04 Kimahriman

Working on getting the Python CI to work, realized I never updated that.

In the mean time, do we want to increase min versions for other things to reduce the testing matrix? Specifically maybe dropping Spark 3.3 and Python 3.7 support?

Kimahriman avatar May 23 '25 15:05 Kimahriman

@Kimahriman

fine by me. Feel free to create a PR to drop 3.3 support (and code), and remove Python 3.7 from the test matrix.

jiayuasu avatar May 23 '25 15:05 jiayuasu

Ok got Python tests working, just required a few more updates:

  • Spark 4 requires Pandas 2, so had to upgrade the Pipfile dependency.
  • Had to also disable DBSCAN python tests
  • Had to fix jiffle being double-shaded into the spark-shaded module

Now that Spark 4 has been officially released, I think this is ready. The jiffle/antlr shading is the main outstanding question. I think the cleanest approach is to just shade it directly into sedona-common, but I'm not a shading or license expert by any means.

Kimahriman avatar May 24 '25 11:05 Kimahriman

I'm on leave right now so won't be able to update off latest master for a couple weeks at least, if anyone else wants to get this finished up. Only outstanding thing before a new release for spark 4 support would be the graphframes dbscan issue. I started trying to work updating that as well but haven't had time to finish up yet and not sure how much progress any of the other people working on graphframes have made

Kimahriman avatar May 30 '25 17:05 Kimahriman

I'm taking a look at graphframes for spark 4

james-willis avatar May 30 '25 21:05 james-willis

Should we merge this? @Kimahriman is getting close on graphframes 4.0 support as well so we will have DBSCAN unblocked soon as well.

james-willis avatar Jun 25 '25 07:06 james-willis

Yeah I think we should merge this so it doesn't get behind again, can undo the DBSCAN test changes once a new graphframes release comes out hopefully soonish 🤞

Kimahriman avatar Jun 25 '25 16:06 Kimahriman

Thank you for the hard work @Kimahriman !

jiayuasu avatar Jun 25 '25 17:06 jiayuasu

@james-willis I started testing out a local build of the graphframes updates and actually getting some failing tests for DBSCAN. Looks like it has to do with https://github.com/graphframes/graphframes/pull/320 which preserves the original ID instead of always using the generated long ID, so the component ID is not always a long anymore. Not sure the best way to address that with how the physical functions and such work. Would be great if you could take a look since you set most of that up

Kimahriman avatar Jul 02 '25 17:07 Kimahriman

ok im talking to sem about this now. i would like to avoid making a breaking change to our API if possible.

james-willis avatar Jul 02 '25 17:07 james-willis

@james-willis I started testing out a local build of the graphframes updates and actually getting some failing tests for DBSCAN. Looks like it has to do with graphframes/graphframes#320 which preserves the original ID instead of always using the generated long ID, so the component ID is not always a long anymore. Not sure the best way to address that with how the physical functions and such work. Would be great if you could take a look since you set most of that up

@Kimahriman would such a solution (https://github.com/graphframes/graphframes/issues/620) be OK for you? tldr: setting conf("spark.graphframes.useLabelsAsComponents", "false") will preserve the generated long ID.

SemyonSinchenko avatar Jul 03 '25 05:07 SemyonSinchenko

That sounds fine by me, I think it should be doable to set a temporary SQL conf around the call to connected components

Kimahriman avatar Jul 03 '25 14:07 Kimahriman

I will probably use the sedona session configurator to default this to true in sedona applications as long as it isnt explicitly set. Ill also add support for returning strings in case this conf is set to true.

james-willis avatar Jul 03 '25 17:07 james-willis