Rasters.jl
Rasters.jl copied to clipboard
Add random sampling functionality (like R's `terra::spatSample()`)
Thanks for a great package! This is a feature request for a function like R’s terra::spatSample() that enables random sampling of locations on a raster, optionally excluding NaN/missing values. Below is some quick-and-dirty example code which demonstrates the functionality, but is inefficient when NaNs should be ignored. Maybe this is already quick & easy to do in Julia? (I am coming from R.)
# Convert Raster to DataFrame, optionally dropping NaNs
function asDataFrame(; x::Rasters.Raster, na_rm::Bool = true)
# Get DataFrame
xyz = DataFrame(x)
rename!(xyz, ["x", "y", "z"])
# Filter non NA cells
if na_rm
xyz = xyz[.!isnan.(xyz.z), :]
end
return xyz
end
# terra::spatSample() replacement, via asDataFrame()
function spatSample(; x::Rasters.Raster, size::Int64 = 1, na_rm::Bool = true)
xyz = asDataFrame(x = x, na_rm = na_rm)
xyz[sample(1:nrow(xyz), size, replace = true), :]
end
No need for anything that complicated ;)
StatsBase gives us the sample method that will work on any AbstractArray. But we need to remove missing values from it first. To do that on pretty much any kind of iterator in Julia we just skipmissing(obj). Then collect turns it into an array again as sampling over a lazy iterator like skipmissing returns cant work efficiently, and sample needs some AbstractArray input anyway. But this is it, here giving you 100 samples:
using StatsBase
sample(collect(skipmissing(raster)), 100)
To keep the missing values, its just
using StatsBase
sample(raster, 100)
But you might want to use replace_missing(rast) first to make sure the missingval is something sensible like missing that the rest of Julia will understand (this will be the default behaviour very soon).
The key message here is unlike in R, a Raster is just an AbstractArray. So most packages that accept AbstractArray will just work with a Rasters.Raster, there is no need for intentional integration. (except in edge cases like larger than memory lazy disk arrays, etc).
Thanks a lot for the quick response & detailed explanation! That is really helpful. That is a very nice solution for sampling values--but I was really hoping to sample both coordinates and values for non-NA cells, hence the use of DataFrame() above. Is there a simple way to do this that avoids converting the entire raster into a DataFrame? Thanks again!
It's probably no problem to convert, it's just DataFrame(raster), just use replace_missing first for now.
DataFrame(replace_missing(raster)).
Then probably sample will work too. You may need to use dropmissings on the DataFrame, look at the ?dropmissings help for syntax.
Fantastic - thank you!
I've been thinking about adding some sample function that mirrors syntax and outputs generated by extract - which includes a keywords to return coordinates and to skip missing values. So good to know that more people want this and let's keep this issue open :)
This is now possible by calling Rasters.sample(x, n; skipmissing = true)!