Rasters.jl
Rasters.jl copied to clipboard
Comparing Julia's Rasters.zonal vs Python's Geopandas + Rasterio
Hi, I am doing a project on nighttime lights. I am moving my code from Python to Julia. I had performed zonal statistics in a specific case in both Julia and Python. I had the code handy, so I felt it's a good opportunity to compare the results.
using Rasters
using Shapefile
using DataFrames
using PyCall
py"""
import rasterio # pip install rasterio
import numpy as np
from rasterio.mask import mask
import geopandas as gpd
def pyzonal(shapefilepath, rasterpath):
shp = gpd.read_file(shapefilepath)
raster = rasterio.open(rasterpath)
def geom_mask(geom, dataset=raster, **mask_kw):
mask_kw.setdefault('crop', True)
mask_kw.setdefault('all_touched', False)
mask_kw.setdefault('filled', False)
masked, mask_transform = mask(dataset=dataset, shapes=(geom,),
**mask_kw)
return masked
return shp.geometry.apply(geom_mask).apply(np.ma.sum)
"""
python_result = convert(Array{Int64},py"""pyzonal("INDIA/Admin2.shp", "DATA/Harmonized_DN_NTL_2008_calDMSP.tif")""")
function jlzonal(shapefilepath, rasterpath)
shp = Shapefile.Table(shapefilepath)
raster = Raster(rasterpath)
Int.(zonal(sum, raster; of=shp, boundary=:inside))
end
julia_result = jlzonal("INDIA/Admin2.shp", "DATA/Harmonized_DN_NTL_2008_calDMSP.tif")
DataFrame(julia_result = julia_result, python_result = python_result) |> print
36×2 DataFrame
Row │ julia_result python_result
│ Int64 Int64
─────┼─────────────────────────────
1 │ 9599 9946
2 │ 235197 238343
3 │ 6338 9135
4 │ 1298732 1316207
5 │ 14545 14683
6 │ 28220 29342
7 │ 7542 7652
8 │ 10986 12186
9 │ 986312 1005796
10 │ 1531090 1547802
11 │ 9346 9723
12 │ 37824 40080
13 │ 197552 202026
14 │ 983131 991010
15 │ 191717 194372
16 │ 332144 344015
17 │ 844276 851621
18 │ 4291 5973
19 │ 1169259 1187453
20 │ 0 108
21 │ 353851 357619
22 │ 10876 15085
23 │ 161010 165474
24 │ 416945 418016
25 │ 95337 102516
26 │ 43565 47522
27 │ 739229 762352
28 │ 162653 168294
29 │ 204255 208089
30 │ 1424513 1444589
31 │ 1119402 1134418
32 │ 615313 626291
33 │ 1055784 1073378
34 │ 9791 17393
35 │ 1973589 1992820
36 │ 7741 7741
I am not an expert on either of the two systems, just a user.
For ease of testing, please clone: https://github.com/xKDR/raster-experiments/ , where the data and the code has been pushed.
The code can also be found here: The state level shapefile for India can be found here: (originally from datameet) The raster file can be found here: (originally from LiEtal2020)
Amazing! Thanks for putting that together.
~~I think boundary=:touches
in Rasters.jl may be a better comparison here? I assume thats what all_touched
does in rasterio.~~
~~:inside
should have less pixels than :touches
as :inside
does not include any pixels the line touches. So this difference isn't unexpected.~~
Edit: I missed that all_touched
is false
.
But does that mean the comparison should be to boundary=:center
?
Edit 2: Yes all_touched, False
in rasterio means the center line is used, so boundary=:center
is the appropriate comparison here. I don't think it even has an option for boundary=:inside
.
https://rasterio.readthedocs.io/en/latest/api/rasterio.mask.html
We could compare boundary=:center
to all_touched, False
, and boundary=:touches
to all_touched, True
Here you go:
-
boundary=:touches
vsall_touched, True
py"""
import rasterio # pip install rasterio
import numpy as np
from rasterio.mask import mask
import geopandas as gpd
def pyzonal(shapefilepath, rasterpath):
shp = gpd.read_file(shapefilepath)
raster = rasterio.open(rasterpath)
def geom_mask(geom, dataset=raster, **mask_kw):
mask_kw.setdefault('crop', True)
mask_kw.setdefault('all_touched', True)
mask_kw.setdefault('filled', False)
masked, mask_transform = mask(dataset=dataset, shapes=(geom,),
**mask_kw)
return masked
return shp.geometry.apply(geom_mask).apply(np.ma.sum)
"""
python_result = convert(Array{Int64},py"""pyzonal("INDIA/Admin2.shp", "DATA/Harmonized_DN_NTL_2008_calDMSP.tif")""")
function jlzonal(shapefilepath, rasterpath)
shp = Shapefile.Table(shapefilepath)
raster = Raster(rasterpath)
Int.(zonal(sum, raster; of=shp, boundary=:touches))
end
julia_result = jlzonal("INDIA/Admin2.shp", "DATA/Harmonized_DN_NTL_2008_calDMSP.tif")
DataFrame(julia_result = julia_result, python_result = python_result) |> print
36×2 DataFrame
Row │ julia_result python_result
│ Int64 Int64
─────┼─────────────────────────────
1 │ 10292 10271
2 │ 241542 241023
3 │ 10590 11267
4 │ 1332685 1330178
5 │ 14848 14818
6 │ 30446 30311
7 │ 7761 7748
8 │ 13210 12996
9 │ 1024225 1021565
10 │ 1566123 1562609
11 │ 9961 9977
12 │ 42033 41918
13 │ 206436 205642
14 │ 998901 997366
15 │ 197394 196811
16 │ 351144 350714
17 │ 859799 858115
18 │ 7650 7483
19 │ 1202572 1200153
20 │ 363 319
21 │ 360502 360163
22 │ 18675 18325
23 │ 170134 169326
24 │ 419006 418850
25 │ 110406 109069
26 │ 50955 50640
27 │ 787630 782535
28 │ 173302 172874
29 │ 212077 211326
30 │ 1464562 1461464
31 │ 1150695 1147173
32 │ 636513 635311
33 │ 1090006 1087124
34 │ 25326 24501
35 │ 2011179 2008439
36 │ 7741 7741
using Rasters
using Shapefile
using DataFrames
using PyCall
py"""
import rasterio # pip install rasterio
import numpy as np
from rasterio.mask import mask
import geopandas as gpd
def pyzonal(shapefilepath, rasterpath):
shp = gpd.read_file(shapefilepath)
raster = rasterio.open(rasterpath)
def geom_mask(geom, dataset=raster, **mask_kw):
mask_kw.setdefault('crop', True)
mask_kw.setdefault('all_touched', False)
mask_kw.setdefault('filled', False)
masked, mask_transform = mask(dataset=dataset, shapes=(geom,),
**mask_kw)
return masked
return shp.geometry.apply(geom_mask).apply(np.ma.sum)
"""
python_result = convert(Array{Int64},py"""pyzonal("INDIA/Admin2.shp", "DATA/Harmonized_DN_NTL_2008_calDMSP.tif")""")
function jlzonal(shapefilepath, rasterpath)
shp = Shapefile.Table(shapefilepath)
raster = Raster(rasterpath)
Int.(zonal(sum, raster; of=shp, boundary=:center))
end
julia_result = jlzonal("INDIA/Admin2.shp", "DATA/Harmonized_DN_NTL_2008_calDMSP.tif")
DataFrame(julia_result = julia_result, python_result = python_result) |> print
36×2 DataFrame
Row │ julia_result python_result
│ Int64 Int64
─────┼─────────────────────────────
1 │ 9946 9946
2 │ 238339 238343
3 │ 8930 9135
4 │ 1316207 1316207
5 │ 14683 14683
6 │ 29306 29342
7 │ 7652 7652
8 │ 12183 12186
9 │ 1005791 1005796
10 │ 1547802 1547802
11 │ 9702 9723
12 │ 40080 40080
13 │ 202011 202026
14 │ 991006 991010
15 │ 194367 194372
16 │ 344010 344015
17 │ 851621 851621
18 │ 5968 5973
19 │ 1187453 1187453
20 │ 104 108
21 │ 357587 357619
22 │ 15057 15085
23 │ 165467 165474
24 │ 418016 418016
25 │ 102414 102516
26 │ 47517 47522
27 │ 762160 762352
28 │ 168294 168294
29 │ 208089 208089
30 │ 1444589 1444589
31 │ 1134418 1134418
32 │ 626291 626291
33 │ 1073373 1073378
34 │ 17393 17393
35 │ 1992748 1992820
36 │ 7741 7741
For 2, Julia and Python are indeed very close. @cmg777
Looking through rasterio it seems we are actually comparing with GDAL, which is a good benchmark.
The second, very close version with :center
is basically the output of PolygonInbounds.jl. I'm not sure what would account for the slight differences, it may be just the difference between a >
and a >=
somewhere in the algorithm - this seems unlikely but people tend to use round numbers so it actually gets hit a lot.
The first :touches
example with bigger difference uses my custom line drawing algorithm to calculate which pixels touch (on top of which are chosen by PolygonInbounds.jl). I think the alg in GDAL is here: https://github.com/OSGeo/gdal/blob/c7e3e652c32773ab5627aafa8c3bd76e01d657b6/alg/gdalrasterize.cpp
The GDAL alg gvBurnScanline
and its variants accept int
start and end values. And rasterio suggests it uses a raw Bresenham algorithm. So it's rounding the start and end point of the line, which is an approximation of which pixels are touched.
My alg uses the exact floating point start and end points from the shape file and steps along the line between those, rather than the line directly between two integer x/y positions. That could be the difference? although I am hesitant to say Rasters.jl is more accurate than GDAL, it is possible!! Seeing the work they have to do with so many variants of that code just to burn in different value types I can see why they might choose the simpler algorithm. It would also be faster than mine - but we get so much speed elsewhere we can afford that a little more.
(if this is all true, we could add boundary=:bresenham
to use to be consistent with GDAL)
Edit - there is also this algorithm which keeps things floating point:
https://github.com/OSGeo/gdal/blob/030ff40cf8340273bcc797e90c938cc32d14a34f/alg/llrasterize.cpp#L377
Its a bit convoluted but I think this may be what we are comparing to. It is very likely as good or better than my algorithm, but it would be interesting to see what the exact differnces are... probably making masks with both and comparing them is the best way to see whats happening.
Edit 2: It would also be good to plot the output boolmask
for each polgon in the shapefile... there is probably just a bug in the line algorithm. Looking at the code I am actually half way through fixing a potential bug I found when writing another line related algorithm but isn't hit by the tests. That would make more sense than anything else I have suggested...
Perhaps I can change the start and end points to Int just for testing. Can you tell me which line to change here? Or perhaps create a separate branch on Rasters.jl with that change? I can generate the zonal statistics using that branch.
I had plotted the boolmask
for a few polygons where the difference was relatively large. I opened the same plot in QGIS. Hard to tell if there was any difference.
I was mistaken, GDAL is actually using Floating point for that - also pretty hard to understand that algorithm!
I can't currently get a rasterio build working on my machine or via conda (another thing I dont miss about python!) but I might rewrite this to use ArchGDAL.jl/GDAL.jl and do the comparison there as it will be the same algorithm.
(we can probably just multiply both output rasters together to find the common pixels then mask them separately with the negative of that to see which pixels are not shared)
Do you know to do zonal statistics in ArchGDAL? Perhaps we can first compare ArchGDAL vs Rasterio.
Yes it should be doable in ArchGDAL, it will just needs a lot more code than Rasters.jl. gdalrasterize
to make a mask of the shapes (converted to GDAL polygons) then do the stats in raw julia on the resulting arrays. The result should be identical to rasterio as it's all just GDAL. I guess Rasters is a rare package to do this natively (mostly because I would like it to be 10x faster than rasterio)
Ok, it seems there is a bug in :touches
. Will have a look at fixing it next weekend.
Since there was a bug fix, I have computed the results again for boundary=:touches
vs all_touched, True
36×2 DataFrame
Row │ julia_result python_result
│ Int64 Int64
─────┼─────────────────────────────
1 │ 10271 10271
2 │ 241015 241023
3 │ 10443 11267
4 │ 1330156 1330178
5 │ 14818 14818
6 │ 30206 30311
7 │ 7744 7748
8 │ 12984 12996
9 │ 1021443 1021565
10 │ 1562556 1562609
11 │ 9922 9977
12 │ 41859 41918
13 │ 205614 205642
14 │ 997308 997366
15 │ 196792 196811
16 │ 350633 350714
17 │ 858087 858115
18 │ 7469 7483
19 │ 1200142 1200153
20 │ 307 319
21 │ 360106 360163
22 │ 18178 18325
23 │ 169262 169326
24 │ 418850 418850
25 │ 108623 109069
26 │ 50596 50640
27 │ 782223 782535
28 │ 172838 172874
29 │ 211326 211326
30 │ 1461293 1461464
31 │ 1147133 1147173
32 │ 635311 635311
33 │ 1087058 1087124
34 │ 24274 24501
35 │ 2008286 2008439
36 │ 7741 7741
Thanks! That looks a lot closer.
I will try tweaking it to see if can be the same, but Im not sure now if the difference is a real problem or just a floating point error or >/>= difference .
The answers are close, but Rasters' results are consistently lower than rasterio's.
Yes that's interesting. So the problem is occasionally not including a pixel that gdal includes.
Ok I've solved this.
The remaining problem is actually in crop
rather than rasterize
. That's why the tests against gdal rasterize didn't catch this.
The way DimensionalData.jl works selecting areas is not exactly what crop
should do when cells are Intervals
- it's dropping cells that are only half inside an Extent
. But for boundary=:touches
and probably most use cases we really want anything that touches the area at all. Using the modified version, the results of zonal
are almost identical. I'll PR soon and make a new minor version for DD as its a breaking change.
The remainder will be the differences on pixel corners in the middle of lines, as in the rasterize
tests:
julia> @time python_zonal = convert(Array{Int64}, py"""pyzonal($shapefilepath, $rasterpath)""");
3.100880 seconds (156 allocations: 4.016 KiB)
julia> @time julia_zonal = jlzonal(shapefilepath, rasterpath);
0.820940 seconds (14.85 k allocations: 175.825 MiB, 0.88% gc time)
julia> difference = python_zonal .- julia_zonal;
julia> DataFrame(; julia_zonal, python_zonal, difference)
36×3 DataFrame
│ Row │ julia_zonal │ python_zonal │ difference │
│ │ Int64 │ Int64 │ Int64 │
├─────┼─────────────┼──────────────┼────────────┤
│ 1 │ 10271 │ 10271 │ 0 │
│ 2 │ 241023 │ 241023 │ 0 │
│ 3 │ 11267 │ 11267 │ 0 │
│ 4 │ 1330178 │ 1330178 │ 0 │
│ 5 │ 14818 │ 14818 │ 0 │
│ 6 │ 30311 │ 30311 │ 0 │
│ 7 │ 7748 │ 7748 │ 0 │
│ 8 │ 12991 │ 12996 │ 5 │
│ 9 │ 1021556 │ 1021565 │ 9 │
│ 10 │ 1562617 │ 1562609 │ -8 │
│ 11 │ 9977 │ 9977 │ 0 │
│ 12 │ 41918 │ 41918 │ 0 │
│ 13 │ 205642 │ 205642 │ 0 │
│ 14 │ 997366 │ 997366 │ 0 │
│ 15 │ 196811 │ 196811 │ 0 │
│ 16 │ 350714 │ 350714 │ 0 │
│ 17 │ 858119 │ 858115 │ -4 │
│ 18 │ 7483 │ 7483 │ 0 │
│ 19 │ 1200148 │ 1200153 │ 5 │
│ 20 │ 319 │ 319 │ 0 │
│ 21 │ 360163 │ 360163 │ 0 │
│ 22 │ 18325 │ 18325 │ 0 │
│ 23 │ 169326 │ 169326 │ 0 │
│ 24 │ 418850 │ 418850 │ 0 │
│ 25 │ 109069 │ 109069 │ 0 │
│ 26 │ 50640 │ 50640 │ 0 │
│ 27 │ 782535 │ 782535 │ 0 │
│ 28 │ 172874 │ 172874 │ 0 │
│ 29 │ 211326 │ 211326 │ 0 │
│ 30 │ 1461464 │ 1461464 │ 0 │
│ 31 │ 1147173 │ 1147173 │ 0 │
│ 32 │ 635311 │ 635311 │ 0 │
│ 33 │ 1087124 │ 1087124 │ 0 │
│ 34 │ 24501 │ 24501 │ 0 │
│ 35 │ 2008439 │ 2008439 │ 0 │
│ 36 │ 7741 │ 7741 │ 0 │
Also good to see it's 4x faster than rasterio.
Brilliant!
Just noting this isn't actually merged yet so reopening until its fixed and registered
This should be fixed now.
I have not tested the GDAL touches approach, but would that make any difference in this comparison since the GDAL algorithm has been changed? See: https://github.com/OSGeo/gdal/issues/8622#issuecomment-1783306250 Thanks!