unconf16 icon indicating copy to clipboard operation
unconf16 copied to clipboard

geoplyr - dplyr style manipulation for geospatial data

Open eamcvey opened this issue 8 years ago • 15 comments

I'm somewhat new to geospatial analysis, and while the tools in R can do all kinds of things, I feel like they operate the "old" R way, not the "new" way I'm now accustomed to from using dplyr, tidyr, and friends. I think there's room for a package that could make working with geospatial data easier and more elegant -- in particular, handling the sp objects intuitively.

eamcvey avatar Mar 07 '16 15:03 eamcvey

:100: I would love a dplyr for spatial data. There are some tools in the rOpenSci suite from @sckott that we should discuss.

karthik avatar Mar 07 '16 18:03 karthik

YES! @eamcvey want to help also with a GSoC 2016 R proposal? https://github.com/rstats-gsoc/gsoc2016/wiki/spatula:-a-sane,-user-centric-(in-the-mental-model-sense)-spatial-operations-package-for-R (geoplyr sounds way cooler than spatula)

hrbrmstr avatar Mar 07 '16 18:03 hrbrmstr

FYI: Just learned abt this from Roger https://github.com/edzer/sfr. Super cool (and well-written) idea!

hrbrmstr avatar Mar 09 '16 11:03 hrbrmstr

Rumours go that the ISC proposal for sfr might get funded -- of course, the ISC still has to announce this first publicly. @karthik and @eamcvey : as I use R mostly the "old" way, could you provide some use cases or mock ups how you would like (new) sp classes to behave more intuitively?

edzer avatar Mar 09 '16 15:03 edzer

@hrbrmstr The GSoC proposal is a great description of what I was thinking - did this get submitted? Some other examples of doing things the "new" way would be:

  • the equivalent of bind_rows() for spatial objects (without having to manually make the IDs unique)
  • the ability to chain operations nicely
  • operations that work appropriately on the data and spatial components of a spatial object (where possible) - for example, filtering

eamcvey avatar Mar 31 '16 20:03 eamcvey

@edzer It looks like the sfr proposal has gotten funded. I'm working on mocking up how I wish geospatial data manipulation worked in R.

eamcvey avatar Mar 31 '16 21:03 eamcvey

@eamcvey i think we're still waiting to hear back on whether @hrbrmstr student's project gets picked as well

sckott avatar Mar 31 '16 21:03 sckott

Start of mockup of how spatial analysis could be if it were possible to work with geometry columns in dataframes: https://github.com/ropenscilabs/geoplyr/blob/master/ideal_mockups.Rmd

eamcvey avatar Mar 31 '16 23:03 eamcvey

@edzer, I don't understand the simple features stuff entirely, but in my optimistic imagination, it could provide the types of simple geometric objects (polygons, points, etc.) that would go into the geometry columns of dataframes as I imagine in this mockup above. If so, I think there could be huge benefit to fitting spatial objects into columns of dataframes and then having access to the existing spectacular "new R" tools in dplyr, tidyr, and purrr to manipulate them (with special functions for operating on the geometry columns). Hadley assures me that there's no fundamental reason this isn't possible : )

eamcvey avatar Mar 31 '16 23:03 eamcvey

My efforts in this area are in two packages, gris and spbabel:

https://github.com/mdsumner/gris

https://github.com/mdsumner/spbabel

Gris is well-developed but I'm not happy with the overall design and user-view yet. It provides a db-like "normalized" structure for spatial objects in multiple linked tables. The point is that you can more easily work on the components (vertices, pieces*, objects) individually, generate other forms like edge-based or primitives-based meshes, and ultimately back-end it with a generic database.

Spbabel is simpler, and starts "in the middle" with something like the ggplot2::fortify (or raster::geom) table of vertices without enforcing uniqueness.

I'm trying to build it into a bigger story but these two blog posts are about as far as it goes:

http://mdsumner.github.io/2015/12/28/gis3d.html

http://mdsumner.github.io/2016/03/03/polygons-R.html

Very keen to explore this idea more, sp is fundamentally limiting in several ways (just like GIS is) but I'm not saying we should disown it, I feel we just need to be able to transform between different forms much more easily.

I'm still catching up with this discussion, just wanted to drop this in :)

I also have done some work on using dplyr with ODBC, which allows me to read in from Manifold GIS directly, amongst other things. I see this all fitting together really nicely with dplyr as the new centre.

https://github.com/mdsumner/dplyrodbc

https://github.com/mdsumner/manifoldr

Cheers, Mike.

mdsumner avatar Apr 01 '16 03:04 mdsumner

@eamcvey do you have the data from your Ideal doc in concrete form? I'd like to work through your document and use it to explain how I see things. If you have those actual data and can share that would be awesome. This is helping me focus somewhat. :)

mdsumner avatar Apr 01 '16 05:04 mdsumner

@mdsumner You're calling my bluff -- I don't actually have that data ; ) But if it's helpful, I can get it, or something quite similar, fairly easily. If the document is helping provide focus, then it's doing its job!

eamcvey avatar Apr 01 '16 14:04 eamcvey

Thanks for the mockup, @eamcvey ! I agree with @mdsumner that we'd need some sample data with it in order to get more concrete.

For your information, sp::aggregate does aggregate polygon information, for the case of nested polygons (say, from districts to provinces) as well as non-nested polygons (assuming constant value throughout the polygons). Your last example would now look like

new_district_df <- aggregate(census_bg_df, list(new_district_df$assigned_district), sum)

which follows the stats::aggregate semantics. Pretty compact, and it dissolves polygons.

Anyway, it would be great if you could for instance provide a census_bg_df shapefile to start with.

edzer avatar Apr 01 '16 19:04 edzer

@eamcvey you must be motivated by real-world data here so of course it's helpful to have actual examples - I don't consider it bluffing here :)

Thinking about the "geometry column" thing - I think that's pretty easy to do, but what I don't like about it is that it doesn't naturally provide a topological data structure - there's no way to share vertices between objects, they all just get copied out in a recursive structure, just in text or in a binary blob - you might as well serialize a Polygons object for example, and store that in a column. It's not hard, it just doesn't really help from my perspective. Topology is what is missing from sp and from most GIS implementations. Also "Polygons" are really just lines with a fill-rule, so you can't pop them out into X-Y-Z - we really need proper surfaces that can be decomposed to triangles, and "polygons" defined by cycles in the mesh are a special case.

There's no way of avoiding the need for at least two tables, one for the vertices and the identifiers for object, part, holiness, and path-ordering, and one for the objects. I just like to take it further, so you can really "normalize" and have vertices (x, y, z, time, etc. with no limit) plus an ID, for that you need at least a vertex table, a branches (or "parts" or pieces" table), and the objects. To normalize the vertices (store only unique rows) you need a vertex-link-branches table.

Gris does this, but it's not dplyr-able yet - I'm working on that. Gris should have the choice of "topology" model - it has branches (the poly-ring, line-string, point, multi-point stuff), and primitives (triangles and/or edges for lines) and it should also have edges (line segments for polys or lines) and the ability to switch between. The constrained Delaunay triangulation in RTriangle is so fast that I think it's worth doing all of this upfront. Then the user can go further to decompose to smaller triangles, shorter line segments , triangles with nicer angles etc. etc. - but the branches, edges and primitives should always be available. It might be a special case to not triangulate, but it's easy enough anyway.

Spbabel is dplyr-pipeable, and has examples to work with the basic verbs on objects, and on the vertices using the sptable(x)<- trick (suggested by @hadley). I think the "vertex-table" in the middle view is a better place to start than gris - it's essentially the ggplot2 fortify table, plus the linked objects. I can go from that to the more-tables more-normalized gris view though for dplyr-abling that's probably not necessary.

mdsumner avatar Apr 02 '16 14:04 mdsumner

@edzer thanks for the aggregate example, I actually forget sp has some of this manipulation built-in. I can see we could usefully have options for group_by() %>% summarize () that unioned objects together using this. I wonder if we need extra arguments to differentiate the summarize function/s from the topological tasks, or if it's best done with new verbs? I need to try this out - I'll be able to in the next few weeks, and as ever very keen to hear from anyone interested in doing this.

mdsumner avatar Apr 02 '16 14:04 mdsumner