sedona icon indicating copy to clipboard operation
sedona copied to clipboard

[GH-2004] Geopandas.GeoSeries: Implement Test Framework

Open petern48 opened this issue 6 months ago • 6 comments

Did you read the Contributor Guide?

Is this PR related to a ticket?

  • Yes, and the PR name follows the format [GH-XXX] my subject.

What changes were proposed in this PR?

Implement GeoSeries test skeleton. The goal here is to fully flush out the common testing code we will use for testing GeoSeries methods. Once we have this merged in, I can rapid-fire GeoSeries method implementations using the same testing structure, occasionally adding function specific tests when needed.

This PR also fixes a bug in the __repr__() method and changes the return type of .area() to pd.Series to be consistent with the Geopandas behavior.

How was this patch tested?

Add tests for existing functionality

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the documentation.

petern48 avatar Jun 23 '25 18:06 petern48

@zhangfengcdt PR definitely isn't fully ready yet, but I want to hear your thoughts. What types of common tests should we include for each function? Originally, I was thinking we have type checking tests (done) and tests for comparing our output to original geopandas function output, but the result of .buffer() is too far off for to match geopandas output successfully. Should we drop that type of test entirely or keep that type of test for the methods that can pass it? I guess we have to add tests for manually checking the output. Any other tests we should add?

As I mentioned in the PR description, I want this PR to flush out all the common tests we use for every method.

petern48 avatar Jun 23 '25 18:06 petern48

@zhangfengcdt PR definitely isn't fully ready yet, but I want to hear your thoughts. What types of common tests should we include for each function? Originally, I was thinking we have type checking tests (done) and tests for comparing our output to original geopandas function output, but the result of .buffer() is too far off for to match geopandas output successfully. Should we drop that type of test entirely or keep that type of test for the methods that can pass it? I guess we have to add tests for manually checking the output. Any other tests we should add?

As I mentioned in the PR description, I want this PR to flush out all the common tests we use for every method.

I think we need both basic tests that manually check the results from the API and the match to geopandas test suits. For the later, we might get some idea from the pandas on spark package, especially how assertPandasOnSparkEqual works in pyspark source.

For the former, Sedona has a test suit for expression (functions) and we may want to at least cover the cases in there: https://github.com/apache/sedona/blob/master/spark/common/src/test/scala/org/apache/sedona/sql/functionTestScala.scala

zhangfengcdt avatar Jun 23 '25 19:06 zhangfengcdt

For various reasons, none of the built in assert methods work out of the box, even when using parameters like checkExact and check_less_precise. I did manage to get it to pass by looping through, which we already do elsewhere in our code, and tuning the tolerance a bit. Personally, I think it's good enough for now.

Another reason I'd like to avoid using the pyspark testing functions (e.g assertPandasOnSparkEqual and assertDataFrameEqual) because they've been removed and added across different version. They're not available until 3.5.0, and we'd have to start using annoying conditional logic like if pyspark.__version__ >= 4.0.0 use this function, else if use this one, else skip, etc. Cleaner and easier to maintain if we just avoid using them all together.

petern48 avatar Jun 24 '25 17:06 petern48

I moved old test code to test_match_geopandas_series.py and created a new test_geoseries.py for micmic the scala tests you mentioned. I guess for these ST_AREA and ST_BUFFER, functionTestScala.scala doesn't test for exact output. Is this still fine @zhangfengcdt?

petern48 avatar Jun 24 '25 20:06 petern48

I moved old test code to test_match_geopandas_series.py and created a new test_geoseries.py for micmic the scala tests you mentioned. I guess for these ST_AREA and ST_BUFFER, functionTestScala.scala doesn't test for exact output. Is this still fine @zhangfengcdt?

Looks great! I think for the first step, we could target covering whatever these scala tests cover in exact results.

zhangfengcdt avatar Jun 24 '25 23:06 zhangfengcdt

Looks great! I think for the first step, we could target covering whatever these scala tests cover in exact results.

That made perfect sense originally, but now I see why it wasn't done in Scala. These test files we're using (self.mixedWktGeometryInputLocation) have 100 entries. Hard coding 100 polygons for a single test (e.g for ST_buffer) would make reading and navigating the test file a real pain.

Another thought. Originally, I had a test checking if the sedona sql function's result matched our new sgpd result. However, we agreed to remove it because it seemed trivial that of course the results would match. Following that same logic, if we can assume the new sedona geopandas results match the sedona sql results, then why do we need to replicate the Scala tests again if we already know sedona sql passes it? Or maybe we should just not make that assumption (in that case maybe it is worth bringing back that old sgpd vs sedona test). Personally, I'm fine with making the assumption. If anything, maybe we should hard-code results for the test_match_geopandas_series.py tests since those results are smaller. WDYT @zhangfengcdt?

petern48 avatar Jun 25 '25 05:06 petern48

Looks great! I think for the first step, we could target covering whatever these scala tests cover in exact results.

That made perfect sense originally, but now I see why it wasn't done in Scala. These test files we're using (self.mixedWktGeometryInputLocation) have 100 entries. Hard coding 100 polygons for a single test (e.g for ST_buffer) would make reading and navigating the test file a real pain.

Another thought. Originally, I had a test checking if the sedona sql function's result matched our new sgpd result. However, we agreed to remove it because it seemed trivial that of course the results would match. Following that same logic, if we can assume the new sedona geopandas results match the sedona sql results, then why do we need to replicate the Scala tests again if we already know sedona sql passes it? Or maybe we should just not make that assumption (in that case maybe it is worth bringing back that old sgpd vs sedona test). Personally, I'm fine with making the assumption. If anything, maybe we should hard-code results for the test_match_geopandas_series.py tests since those results are smaller. WDYT @zhangfengcdt?

I think the reason why we still need the hard-coded tests are that they are not simply replicating the scala expression tests, they should extend the scala cases. Basically, the goals are:

  • Verify the SQL created to convert geopandas -> sedona queries are expected;
  • Verify manually the results for matching or differences (reason) between geopandas origin and geopandas on Sedona.

This means we will very possibly cover more cases than the scala expression tests.

zhangfengcdt avatar Jun 25 '25 17:06 zhangfengcdt

Yes that makes sense, but the hard-coded outputs from the original scala tests would clutter up the codebase too much due to their sheer size (possibly like 1000+ lines for some test cases). IMO, it's not reasonable to hard code those outputs. Then at that point, if we're not extending the scala tests, is it worth still replicate those tests exactly in in python (test_geoseries.py)?

How do you feel about hard-coding the geopandas match test cases instead? Those ones are smaller, so the expected results won't be too bad.

petern48 avatar Jun 25 '25 17:06 petern48

Yes that makes sense, but the hard-coded outputs from the original scala tests would clutter up the codebase too much due to their sheer size (possibly like 1000+ lines for some test cases). IMO, it's not reasonable to hard code those outputs. Then at that point, if we're not extending the scala tests, is it worth still replicate those tests exactly in in python (test_geoseries.py)?

How do you feel about hard-coding the geopandas match test cases instead? Those ones are smaller, so the expected results won't be too bad.

Oh, I meant we can replicate the types of checks from that scala test, not the exact data, we can use smaller inputs for sure.

zhangfengcdt avatar Jun 25 '25 17:06 zhangfengcdt

How's this

petern48 avatar Jun 25 '25 21:06 petern48