sedona icon indicating copy to clipboard operation
sedona copied to clipboard

Reconciling rasterstats.zonal_stats with RS_ZonalStats

Open ticrz opened this issue 7 months ago • 3 comments

Hi team,

First of all, thank you for the great work on Sedona!

I am currently attempting to replace some legacy rasterstats scripts with a Sedona based SQL approach. However, I'm facing challenges reconciling the outputs from Sedona's RS_ZonalStats function with those from the original Python rasterstats.zonal_stats.

My goal is to perform zonal statistics on a raster dataset, aggregating values based on polygon geometries. To debug the discrepancy, I've experimented specifically with pixel counts and found the following:

Using RS_PixelAsPolygons + ST_Intersects, I am able to almost match (depending on the shapes) results from Python's rasterstats.zonal_stats.

However, using RS_ZonalStats, RS_Clip, or RS_AsRaster, the results do not match, and the reason for the differences isn't clear to me.

You will find below 2 Jupyter notebooks demonstrating my experimentations : Loading shapes (two US counties) Loading a raster (.tif file with surface temperature data) Comparison of pixel count results from different methods

Could someone please help clarify why these discrepancies occur or suggest best practices for using RS_ZonalStats in a way consistent with the traditional rasterstats results?

Thanks very much in advance for your help!

Using rasterstats

! pip install geopandas==1.0.1 rasterstats==0.20.0

! wget https://www2.census.gov/geo/tiger/GENZ2023/shp/cb_2023_us_county_20m.zip
! unzip cb_2023_us_county_20m.zip

! wget https://geodata.ucdavis.edu/climate/worldclim/2_1/hist/cts4.06/10m/wc2.1_cruts4.06_10m_tmin_2020-2021.zip
! unzip wc2.1_cruts4.06_10m_tmin_2020-2021.zip

import geopandas as gpd
from rasterstats import zonal_stats

shapefile_path = "cb_2023_us_county_20m.shp"
raster_path = "wc2.1_10m_tmin_2021-12.tif"
# Load the shapefile into a GeoDataFrame
gdf = gpd.read_file(shapefile_path)

# Perform zonal statistics to count raster pixels within each geometry
# all_touched=True includes all pixels touched by geometries
stats = zonal_stats(
    vectors=gdf.geometry,
    raster=raster_path,
    stats=["count"],
    all_touched=True,
    raster_out=False
)

# Extract pixel counts and add as a new column
gdf["pixel_count"] = [s["count"] for s in stats]

# 2px for Alexandria and 2810px for Yukon-Koyukuk
gdf[(gdf["NAME"]=="Yukon-Koyukuk") | (gdf["NAME"]=="Alexandria")]

Using sedona

! pip install apache-sedona[spark]==1.7.1

from sedona.spark import *
import pyspark.sql.functions as F

config = (
        SedonaContext.builder()
        .config(
            "spark.jars.packages",
            "org.apache.sedona:sedona-spark-3.5_2.12:1.7.1,"
            "org.datasyslab:geotools-wrapper:1.7.1-28.5",
        )
        .config(
            "spark.jars.repositories",
            "https://artifacts.unidata.ucar.edu/repository/unidata-all",
        )
        .config("spark.executor.memory", "8g")
        .config("spark.driver.memory", "8g")
        .getOrCreate()
    )
sedona = SedonaContext.create(config)

############# Load the shapes
sedona_shapes = sedona.read.format("shapefile").option("charset", "UTF-8").load(shapefile_path)
sedona_shapes.show()
sedona_shapes.createOrReplaceTempView("sedona_shapes")


############# LOAD RASTER WEATHER TIF
sedona_binary_raster = sedona.read.format("binaryFile").load(raster_path)
sedona_binary_raster.createOrReplaceTempView("sedona_binary_raster")

sedona_tif_raster = sedona_binary_raster.withColumn("raster", F.expr("RS_FromGeoTiff(content)"))
sedona_tif_raster.createOrReplaceTempView("sedona_tif_raster")

# Use RS_ZonalStats to count the number of pixels in the geometry
# Yukon-Koyukuk : 4681 px
# Alexandria : 5 px
# Very far from rasterstats!!
df_zonal_stats = sedona.sql("""
    SELECT
        shapes.name,
        RS_ZonalStats(sedona_tif_raster.raster, shapes.geometry, 1, 'count', true) AS count
    FROM sedona_tif_raster, sedona_shapes AS shapes
	WHERE shapes.name in ("Yukon-Koyukuk", "Alexandria")
    """).show()
	
# Using RS_PixelAsPolygons + ST_Intersects
# Yukon-Koyukuk : 2856 px
# Alexandria : 2 px
# These are the closest results to rasterstats 
intersect = sedona.sql("""
WITH exploded as (SELECT
  explode(RS_PixelAsPolygons(sedona_tif_raster.raster, 1)) 
FROM sedona_tif_raster)

SELECT col.geom, col.value, shapes.name, col.x, col.y FROM exploded, sedona_shapes as shapes
WHERE ST_Intersects(col.geom, shapes.geometry) AND shapes.name in ("Yukon-Koyukuk", "Alexandria")
""")
intersect.groupby('name').count().show()


# Using RS_AsRaster on Alaska (+filtering on non null values)
# Yukon-Koyukuk : 1931 px
# Alexandria : 2 px
# The envelope for Yukon-Koyukuk seems correct BUT there are pixels inside that have their value set to 0, 
# creating "holes" after value filtering - see map below
rasterized = sedona.sql("""
WITH rasterized AS (SELECT 
explode(RS_PixelAsPolygons(RS_AsRaster(shapes.geometry, sedona_tif_raster.raster, "D", true),1))
FROM sedona_shapes as shapes, sedona_tif_raster
WHERE shapes.name in ("Yukon-Koyukuk", "Alexandria")
)

SELECT col.geom, col.value, col.x, col.y
FROM rasterized
WHERE col.value != 0
""")
rasterized.count()


# Map visualisation of my experimentations
m = SedonaKepler.create_map()
SedonaKepler.add_df(m, sedona.sql('select * from sedona_shapes where name in ("Yukon-Koyukuk", "Alexandria")'), name='shapes')
SedonaKepler.add_df(m, intersect, name='intersect')
SedonaKepler.add_df(m, rasterized, name='rasterized')
m

ticrz avatar May 18 '25 20:05 ticrz

Thank you for your interest in Apache Sedona! We appreciate you opening your first issue. Contributions like yours help make Apache Sedona better.

github-actions[bot] avatar May 18 '25 20:05 github-actions[bot]

Hi team, Did you get some to look at my question ? Please do let me know if I should provide more details or if you have any hints for me. Thanks!

ticrz avatar Jun 24 '25 14:06 ticrz

@timcrouzet thanks. We are looking into this!

jiayuasu avatar Jun 24 '25 17:06 jiayuasu

Hi @timcrouzet, thanks for raising the issue. The bug should be fixed now.

For your reference, after the fix -

  • Using rasterstats.zonal_stats -

    • 2px for Alexandria
    • 2810px for Yukon-Koyukuk
  • Using sedona.rs_zonalstats -

    • 2px for Alexandria
    • 2842px for Yukon-Koyukuk

prantogg avatar Jul 03 '25 16:07 prantogg

@prantogg sorry, I didn't see the bug fix PR on Sedona repo. Can you double check?

Thanks!

jiayuasu avatar Jul 04 '25 04:07 jiayuasu

Thank you very much guys! I will test this when this gets released.

ticrz avatar Jul 07 '25 05:07 ticrz