Reconciling rasterstats.zonal_stats with RS_ZonalStats
Hi team,
First of all, thank you for the great work on Sedona!
I am currently attempting to replace some legacy rasterstats scripts with a Sedona based SQL approach. However, I'm facing challenges reconciling the outputs from Sedona's RS_ZonalStats function with those from the original Python rasterstats.zonal_stats.
My goal is to perform zonal statistics on a raster dataset, aggregating values based on polygon geometries. To debug the discrepancy, I've experimented specifically with pixel counts and found the following:
Using RS_PixelAsPolygons + ST_Intersects, I am able to almost match (depending on the shapes) results from Python's rasterstats.zonal_stats.
However, using RS_ZonalStats, RS_Clip, or RS_AsRaster, the results do not match, and the reason for the differences isn't clear to me.
You will find below 2 Jupyter notebooks demonstrating my experimentations : Loading shapes (two US counties) Loading a raster (.tif file with surface temperature data) Comparison of pixel count results from different methods
Could someone please help clarify why these discrepancies occur or suggest best practices for using RS_ZonalStats in a way consistent with the traditional rasterstats results?
Thanks very much in advance for your help!
Using rasterstats
! pip install geopandas==1.0.1 rasterstats==0.20.0
! wget https://www2.census.gov/geo/tiger/GENZ2023/shp/cb_2023_us_county_20m.zip
! unzip cb_2023_us_county_20m.zip
! wget https://geodata.ucdavis.edu/climate/worldclim/2_1/hist/cts4.06/10m/wc2.1_cruts4.06_10m_tmin_2020-2021.zip
! unzip wc2.1_cruts4.06_10m_tmin_2020-2021.zip
import geopandas as gpd
from rasterstats import zonal_stats
shapefile_path = "cb_2023_us_county_20m.shp"
raster_path = "wc2.1_10m_tmin_2021-12.tif"
# Load the shapefile into a GeoDataFrame
gdf = gpd.read_file(shapefile_path)
# Perform zonal statistics to count raster pixels within each geometry
# all_touched=True includes all pixels touched by geometries
stats = zonal_stats(
vectors=gdf.geometry,
raster=raster_path,
stats=["count"],
all_touched=True,
raster_out=False
)
# Extract pixel counts and add as a new column
gdf["pixel_count"] = [s["count"] for s in stats]
# 2px for Alexandria and 2810px for Yukon-Koyukuk
gdf[(gdf["NAME"]=="Yukon-Koyukuk") | (gdf["NAME"]=="Alexandria")]
Using sedona
! pip install apache-sedona[spark]==1.7.1
from sedona.spark import *
import pyspark.sql.functions as F
config = (
SedonaContext.builder()
.config(
"spark.jars.packages",
"org.apache.sedona:sedona-spark-3.5_2.12:1.7.1,"
"org.datasyslab:geotools-wrapper:1.7.1-28.5",
)
.config(
"spark.jars.repositories",
"https://artifacts.unidata.ucar.edu/repository/unidata-all",
)
.config("spark.executor.memory", "8g")
.config("spark.driver.memory", "8g")
.getOrCreate()
)
sedona = SedonaContext.create(config)
############# Load the shapes
sedona_shapes = sedona.read.format("shapefile").option("charset", "UTF-8").load(shapefile_path)
sedona_shapes.show()
sedona_shapes.createOrReplaceTempView("sedona_shapes")
############# LOAD RASTER WEATHER TIF
sedona_binary_raster = sedona.read.format("binaryFile").load(raster_path)
sedona_binary_raster.createOrReplaceTempView("sedona_binary_raster")
sedona_tif_raster = sedona_binary_raster.withColumn("raster", F.expr("RS_FromGeoTiff(content)"))
sedona_tif_raster.createOrReplaceTempView("sedona_tif_raster")
# Use RS_ZonalStats to count the number of pixels in the geometry
# Yukon-Koyukuk : 4681 px
# Alexandria : 5 px
# Very far from rasterstats!!
df_zonal_stats = sedona.sql("""
SELECT
shapes.name,
RS_ZonalStats(sedona_tif_raster.raster, shapes.geometry, 1, 'count', true) AS count
FROM sedona_tif_raster, sedona_shapes AS shapes
WHERE shapes.name in ("Yukon-Koyukuk", "Alexandria")
""").show()
# Using RS_PixelAsPolygons + ST_Intersects
# Yukon-Koyukuk : 2856 px
# Alexandria : 2 px
# These are the closest results to rasterstats
intersect = sedona.sql("""
WITH exploded as (SELECT
explode(RS_PixelAsPolygons(sedona_tif_raster.raster, 1))
FROM sedona_tif_raster)
SELECT col.geom, col.value, shapes.name, col.x, col.y FROM exploded, sedona_shapes as shapes
WHERE ST_Intersects(col.geom, shapes.geometry) AND shapes.name in ("Yukon-Koyukuk", "Alexandria")
""")
intersect.groupby('name').count().show()
# Using RS_AsRaster on Alaska (+filtering on non null values)
# Yukon-Koyukuk : 1931 px
# Alexandria : 2 px
# The envelope for Yukon-Koyukuk seems correct BUT there are pixels inside that have their value set to 0,
# creating "holes" after value filtering - see map below
rasterized = sedona.sql("""
WITH rasterized AS (SELECT
explode(RS_PixelAsPolygons(RS_AsRaster(shapes.geometry, sedona_tif_raster.raster, "D", true),1))
FROM sedona_shapes as shapes, sedona_tif_raster
WHERE shapes.name in ("Yukon-Koyukuk", "Alexandria")
)
SELECT col.geom, col.value, col.x, col.y
FROM rasterized
WHERE col.value != 0
""")
rasterized.count()
# Map visualisation of my experimentations
m = SedonaKepler.create_map()
SedonaKepler.add_df(m, sedona.sql('select * from sedona_shapes where name in ("Yukon-Koyukuk", "Alexandria")'), name='shapes')
SedonaKepler.add_df(m, intersect, name='intersect')
SedonaKepler.add_df(m, rasterized, name='rasterized')
m
Thank you for your interest in Apache Sedona! We appreciate you opening your first issue. Contributions like yours help make Apache Sedona better.
Hi team, Did you get some to look at my question ? Please do let me know if I should provide more details or if you have any hints for me. Thanks!
@timcrouzet thanks. We are looking into this!
Hi @timcrouzet, thanks for raising the issue. The bug should be fixed now.
For your reference, after the fix -
-
Using
rasterstats.zonal_stats-2pxfor Alexandria2810pxfor Yukon-Koyukuk
-
Using
sedona.rs_zonalstats-2pxfor Alexandria2842pxfor Yukon-Koyukuk
@prantogg sorry, I didn't see the bug fix PR on Sedona repo. Can you double check?
Thanks!
Thank you very much guys! I will test this when this gets released.