sf
sf copied to clipboard
`st_sample` for very small, scattered multipolygons
This issue was brought up to me by email. The naive algorithm used in st_sample
to randomly sample n observatoins from a (multi-)polygon was to
- compute the area of the polygon A
- compute the area of its bounding box B
- sample n * B / A points from the bounding box
- select those within the polygon
- if
exact = TRUE
continue sampling until we have n
This algorithm fails if B/A is very large, e.g. when sampling a very rare land use class over a large region (out of memory errors). The solution I worked on is this:
- compute the areas of sub-polygons
- distribute n proportional to these sub-areas
- sample each sub-polygon
This worked somewhat, but wouldn't converge because many n values would be rounded to 0. The current implementation (which is not satisfactory) will, for polygons with rounded n of zero, use rbinom
to draw a sample with probability n. See:
https://github.com/r-spatial/sf/commit/bbb7337eced301b2250b5f452d9f284aed44d80c https://github.com/r-spatial/sf/commit/4a910daec46a02d471aeb45b89002c0471f0c49b https://github.com/r-spatial/sf/commit/bb813992d6e8f36c84dd17afc73c932ea3e35827
I think the better solution would be to
- compute the areas of the sub-polygons
- divide them by the total
- use that as a cutoff values for a uniform distributions
- draw n U[0,1] values
- classify them according to this distribution, and then draw the target n' values from each subpolygon.
This still needs to be done.
I think I might have come up with a workaround on this. I had a set of 520 occurrence records, and needed to resample each of them inside a buffer around the original point, to see if coordinate precision will influence my model or not.
For that, I got the original dataset as a dataframe (directly imported from a csv and cleaned with cleancoordinates() )
> head(occd.cl)
species lat lon uncert
1 alces 0.901015 -52.00409 2895
2 alces 3.802764 -61.74121 4181
3 alces 3.789833 -54.31493 65117
4 alces 5.664470 -55.17876 21503
5 armiger -0.269760 -79.10875 9725
6 batesii -9.522227 -66.92051 24918
turned it into a spatial points data frame and assigned it WGS84 projection:
coordinates(occd.cl) <- ~lon + lat
proj4string(occd.cl) <- projection(raster())
generated the buffer around the points and turned them into polygons using the sp package:
radius <- 0.5 #1=100km
occ.buffer <- gBuffer(spgeom = occd.cl, byid = T,
width = radius, quadsegs = 100,
capStyle = 'ROUND' , joinStyle = 'ROUND')
and then converted into sfc:
occb.sf <- as(occ.buffer, "sfc")
Finally, I created a copy of the original points (class=SpatialPointDataFrame) to receive the data from st_sample, and used for to generate a single point for each polygon:
occ.rd.dt.1 <- as(occd.cl,"sfc")
for(i in 1:length(occ.rd.dt.1)){
occ.rd.dt.1[i] <- st_sample(x =occb.sf[i,], size = 1, type = "random", exact =T, by_polygon = T)
}
To my understanding, this has generated a random point inside the polygons generated around the original points. This is what I needed to do, but I think the iteration at the end of the code can work with the description of this topic. I'm sorry about all the detail, but I think it's always necessary.
I'll completely understand if I missed the point and this has nothing to do with it, and this comment has to be removed.
Thanks! I think your sample is a stratified random sample, and your algorithm works because your polygons (buffers) are of equal size. The issue I tried to address is to randomly sample over a set of very small polygons (or a multi-polygon with sub-polygons) of unequal size.