multidplyr icon indicating copy to clipboard operation
multidplyr copied to clipboard

Support for spatial data frames - {sf} package format

Open jlacko opened this issue 3 years ago • 4 comments

The {multidplyr} package changes class of object distributed to workers to multidplyr_party_df. This causes a loss of the "special sauce" that is provided by the {sf} package for spatial datasets (special interpretation of the geometry column, and information about the coordinate reference system).

It would be advantageous for spatial data processing to allow parallelization of some tasks, such as point-in-polygon operation demonstrated in the reprex bellow.

To do so would likely require keeping the class of the distributed object unchanged (or perhaps re-implementing the sf methods, in which case the issue would likely fall outside of scope of the {multidplyr} package).

library(sf)
library(dplyr)   

# Well known & much loved shapefile of NC included with sf package
nc_polygons <- sf::st_read(system.file("shape/nc.shp", 
                                       package="sf"),
                           quiet = T)

# NC state lines
nc_state <- nc_polygons %>% 
  summarise()

# some random points over NC state
nc_points <- sf::st_sample(nc_state, 500) %>% 
  st_as_sf()

# this will work - single core
result <- nc_polygons %>% 
  st_join(nc_points) %>% 
  group_by(NAME) %>% 
  count() %>% 
  st_drop_geometry()

# this will break
library(multidplyr)

cluster <- new_cluster(4)
cluster_library(cluster, packages = c("dplyr", "sf"))
cluster_copy(cluster, "nc_points")

result <- nc_polygons %>% 
  partition(cluster) %>%
  st_join(nc_points) %>%  # the pipe breaks here, even though nc_points is available
  group_by(NAME) %>% 
  count() %>% 
  st_drop_geometry() %>% 
  collect()



jlacko avatar Feb 25 '21 17:02 jlacko

Hello,

I have the same issue with the fable package. I am able to build the model fit because the data is a tsibble and inherits from tibble, but when I go to forecast, the dataframe is a mable, a model table or a table of models, and this gets converted to a tibble so the fable package's forecast function doesn't know what to do with it.

It would be nice if there was some way to inherit the class of the object that is being passed to the cluster.

This is a great package by the way. The future package is much more complicated than this, touchy, and inconsistent when trying to take advantage of nearly all cores. Please do not let this project slide.

Thanks!!

Fredo-XVII avatar Jul 08 '21 04:07 Fredo-XVII

Definitely second the {sf} package inheritance request. I may be wrong, but multidplyr is an incredible opportunity to make massive computations more efficient.

astraetech avatar Oct 29 '21 23:10 astraetech

I also would like to see support for sf (or in general other "specialised" tibble classes).

As an aside, to work with sf in a parallel pipe:

  grid_sf3 <- grid_sf2 %>%
    multidplyr::partition(cluster) %>%
    dplyr::mutate(
      dist = as.numeric(sf::st_distance(geometry, coast))
    ) %>%
    dplyr::collect() %>%
    sf::st_sf()

avsdev-cw avatar Nov 10 '21 12:11 avsdev-cw

Hello! does anyone have a better approach? Unfortunately the only thing I can think of is to process with multidplyr the parts of the "data.frame" without the geometries and on the other hand do the sf operations to end up doing a join between both tables. I also did some things with furrr but I have the feeling that there are endless and unreadable lines of code to do something relatively simple. Also, in the worst case, put together a Spark cluster just for a summarize().

jfulponi avatar Jul 23 '23 01:07 jfulponi