fuzzyjoin icon indicating copy to clipboard operation
fuzzyjoin copied to clipboard

R crashes when doing a difference_left_join on larger datasets

Open pabuta opened this issue 6 years ago • 2 comments

I have the following setting:

DatasetA: approximately 20 thousand rows, coordinates are given as latA, lonA DatasetB: approximately 2 million rows, coordinates are given as latB, lonB

Because the coordinates do not exactly match, I tried the following:

DatasetC <- DatasetA %>% difference_left_join(DatasetB, by = c("latA" = "latB", "lonA" = "lonB"), max_dist = 2)

This works when I take a sample (e.g. 10%) from DatasetA but repeatedly crashes when using the entire dataset. Did you experience similar behaviour?

pabuta avatar Apr 20 '18 13:04 pabuta

PostGIS or a GEOS-based R package such as spdep or sf would be far more efficient for this kind of operation.

dylanbeaudette avatar Apr 20 '18 16:04 dylanbeaudette

If you like the fuzzy join approach, you may be able to subset your data frames by region, then fuzzy join within regions, then join up the resultant data frames.

https://github.com/dgrtwo/fuzzyjoin/issues/51#issuecomment-543483010

markbneal avatar Oct 18 '19 04:10 markbneal