sf icon indicating copy to clipboard operation
sf copied to clipboard

`st_sample` for very small, scattered multipolygons

Open edzer opened this issue 4 years ago • 2 comments

This issue was brought up to me by email. The naive algorithm used in st_sample to randomly sample n observatoins from a (multi-)polygon was to

  • compute the area of the polygon A
  • compute the area of its bounding box B
  • sample n * B / A points from the bounding box
  • select those within the polygon
  • if exact = TRUE continue sampling until we have n

This algorithm fails if B/A is very large, e.g. when sampling a very rare land use class over a large region (out of memory errors). The solution I worked on is this:

  • compute the areas of sub-polygons
  • distribute n proportional to these sub-areas
  • sample each sub-polygon

This worked somewhat, but wouldn't converge because many n values would be rounded to 0. The current implementation (which is not satisfactory) will, for polygons with rounded n of zero, use rbinom to draw a sample with probability n. See:

https://github.com/r-spatial/sf/commit/bbb7337eced301b2250b5f452d9f284aed44d80c https://github.com/r-spatial/sf/commit/4a910daec46a02d471aeb45b89002c0471f0c49b https://github.com/r-spatial/sf/commit/bb813992d6e8f36c84dd17afc73c932ea3e35827

I think the better solution would be to

  • compute the areas of the sub-polygons
  • divide them by the total
  • use that as a cutoff values for a uniform distributions
  • draw n U[0,1] values
  • classify them according to this distribution, and then draw the target n' values from each subpolygon.

This still needs to be done.

edzer avatar Sep 03 '20 14:09 edzer

I think I might have come up with a workaround on this. I had a set of 520 occurrence records, and needed to resample each of them inside a buffer around the original point, to see if coordinate precision will influence my model or not.

For that, I got the original dataset as a dataframe (directly imported from a csv and cleaned with cleancoordinates() )

> head(occd.cl)
  species       lat       lon uncert
1   alces  0.901015 -52.00409   2895
2   alces  3.802764 -61.74121   4181
3   alces  3.789833 -54.31493  65117
4   alces  5.664470 -55.17876  21503
5 armiger -0.269760 -79.10875   9725
6 batesii -9.522227 -66.92051  24918

turned it into a spatial points data frame and assigned it WGS84 projection:

coordinates(occd.cl) <- ~lon + lat
proj4string(occd.cl) <- projection(raster())

generated the buffer around the points and turned them into polygons using the sp package:

radius <- 0.5 #1=100km
occ.buffer <- gBuffer(spgeom = occd.cl, byid = T, 
                      width = radius, quadsegs = 100, 
                      capStyle = 'ROUND' , joinStyle = 'ROUND')

and then converted into sfc: occb.sf <- as(occ.buffer, "sfc")

Finally, I created a copy of the original points (class=SpatialPointDataFrame) to receive the data from st_sample, and used for to generate a single point for each polygon:

occ.rd.dt.1 <- as(occd.cl,"sfc")  

for(i in 1:length(occ.rd.dt.1)){
 occ.rd.dt.1[i] <- st_sample(x =occb.sf[i,], size = 1, type = "random", exact =T, by_polygon = T)
}

To my understanding, this has generated a random point inside the polygons generated around the original points. This is what I needed to do, but I think the iteration at the end of the code can work with the description of this topic. I'm sorry about all the detail, but I think it's always necessary.

I'll completely understand if I missed the point and this has nothing to do with it, and this comment has to be removed.

jfberner avatar Jun 15 '21 20:06 jfberner

Thanks! I think your sample is a stratified random sample, and your algorithm works because your polygons (buffers) are of equal size. The issue I tried to address is to randomly sample over a set of very small polygons (or a multi-polygon with sub-polygons) of unequal size.

edzer avatar Jun 15 '21 21:06 edzer