Rasters.jl icon indicating copy to clipboard operation
Rasters.jl copied to clipboard

Add random sampling functionality (like R's `terra::spatSample()`)

Open edwardlavender opened this issue 1 year ago • 5 comments

Thanks for a great package! This is a feature request for a function like R’s terra::spatSample() that enables random sampling of locations on a raster, optionally excluding NaN/missing values. Below is some quick-and-dirty example code which demonstrates the functionality, but is inefficient when NaNs should be ignored. Maybe this is already quick & easy to do in Julia? (I am coming from R.)

# Convert Raster to DataFrame, optionally dropping NaNs 
function asDataFrame(; x::Rasters.Raster, na_rm::Bool = true)
    # Get DataFrame
    xyz = DataFrame(x)
    rename!(xyz, ["x", "y", "z"])
    # Filter non NA cells 
    if na_rm
        xyz = xyz[.!isnan.(xyz.z), :]
    end
    return xyz
end 

# terra::spatSample() replacement, via asDataFrame()
function spatSample(; x::Rasters.Raster, size::Int64 = 1, na_rm::Bool = true)
    xyz = asDataFrame(x = x, na_rm = na_rm)
    xyz[sample(1:nrow(xyz), size, replace = true), :]
end 

edwardlavender avatar Sep 30 '24 10:09 edwardlavender

No need for anything that complicated ;)

StatsBase gives us the sample method that will work on any AbstractArray. But we need to remove missing values from it first. To do that on pretty much any kind of iterator in Julia we just skipmissing(obj). Then collect turns it into an array again as sampling over a lazy iterator like skipmissing returns cant work efficiently, and sample needs some AbstractArray input anyway. But this is it, here giving you 100 samples:

using StatsBase
sample(collect(skipmissing(raster)), 100)

To keep the missing values, its just

using StatsBase
sample(raster, 100)

But you might want to use replace_missing(rast) first to make sure the missingval is something sensible like missing that the rest of Julia will understand (this will be the default behaviour very soon).

The key message here is unlike in R, a Raster is just an AbstractArray. So most packages that accept AbstractArray will just work with a Rasters.Raster, there is no need for intentional integration. (except in edge cases like larger than memory lazy disk arrays, etc).

rafaqz avatar Sep 30 '24 11:09 rafaqz

Thanks a lot for the quick response & detailed explanation! That is really helpful. That is a very nice solution for sampling values--but I was really hoping to sample both coordinates and values for non-NA cells, hence the use of DataFrame() above. Is there a simple way to do this that avoids converting the entire raster into a DataFrame? Thanks again!

edwardlavender avatar Sep 30 '24 13:09 edwardlavender

It's probably no problem to convert, it's just DataFrame(raster), just use replace_missing first for now.

DataFrame(replace_missing(raster)).

Then probably sample will work too. You may need to use dropmissings on the DataFrame, look at the ?dropmissings help for syntax.

rafaqz avatar Sep 30 '24 13:09 rafaqz

Fantastic - thank you!

edwardlavender avatar Sep 30 '24 13:09 edwardlavender

I've been thinking about adding some sample function that mirrors syntax and outputs generated by extract - which includes a keywords to return coordinates and to skip missing values. So good to know that more people want this and let's keep this issue open :)

tiemvanderdeure avatar Oct 01 '24 12:10 tiemvanderdeure

This is now possible by calling Rasters.sample(x, n; skipmissing = true)!

tiemvanderdeure avatar Oct 16 '24 12:10 tiemvanderdeure