fuzzyjoin
fuzzyjoin copied to clipboard
R crashes when doing a difference_left_join on larger datasets
I have the following setting:
DatasetA: approximately 20 thousand rows, coordinates are given as latA, lonA DatasetB: approximately 2 million rows, coordinates are given as latB, lonB
Because the coordinates do not exactly match, I tried the following:
DatasetC <- DatasetA %>% difference_left_join(DatasetB, by = c("latA" = "latB", "lonA" = "lonB"), max_dist = 2)
This works when I take a sample (e.g. 10%) from DatasetA but repeatedly crashes when using the entire dataset. Did you experience similar behaviour?
PostGIS or a GEOS-based R package such as spdep or sf would be far more efficient for this kind of operation.
If you like the fuzzy join approach, you may be able to subset your data frames by region, then fuzzy join within regions, then join up the resultant data frames.
https://github.com/dgrtwo/fuzzyjoin/issues/51#issuecomment-543483010