fuzzyjoin
fuzzyjoin copied to clipboard
Error: vector memory exhausted (limit reached?)
Error: vector memory exhausted (limit reached?)
I’m getting the above error when trying to stringdist_left_join two tables - the left table is 185K rows and the right table is 4.37M rows. The R session never appears to use more than 6GB of memory (according to Activity Monitor) while I’m on a machine with 32GB of memory with available memory in the range of 10GB when the vector memory exhausted error arises. I’ve followed various recommendations to increase R_MAX_VSIZE to a large number - 700GB as shown in the Sys.getenv() output shown below. All this to say it appears that stringdist_left_join does not pay attention to R_MAX_VSIZE. Is there some other setting I can change to use more of the available memory on my machine?
Sys.getenv()
Apple_PubSub_Socket_Render /private/tmp/com.apple.launchd.sSrL33I64Z/Render
COLUMNS 80
COMMAND_MODE unix2003
DISPLAY /private/tmp/com.apple.launchd.tTt2eLd6xQ/org.macosforge.xquartz:0
DYLD_FALLBACK_LIBRARY_PATH /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk1.8.0_91.jdk/Contents/Home/jre/lib/server
DYLD_LIBRARY_PATH /Library/Java/JavaVirtualMachines/jdk1.8.0_91.jdk/Contents/Home/jre/lib/server
EDITOR vi
HOME /Users/geoffreysnyder
LD_LIBRARY_PATH :@JAVA_LD@
LINES 24
LN_S ln -s
LOGNAME geoffreysnyder
MAKE make
PAGER /usr/bin/less
PATH /usr/local/bin:/usr/local/mysql/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:~/Library/Python/3.7/bin
PWD /Users/geoffreysnyder/repos/Data_Load/code
R_ARCH
R_BROWSER /usr/bin/open
R_BZIPCMD /usr/bin/bzip2
R_DOC_DIR /Library/Frameworks/R.framework/Resources/doc
R_GZIPCMD /usr/bin/gzip
R_HOME /Library/Frameworks/R.framework/Resources
R_INCLUDE_DIR /Library/Frameworks/R.framework/Resources/include
R_LIBS_SITE
R_LIBS_USER ~/Library/R/3.5/library
R_MAX_VSIZE 700GB
R_PAPERSIZE a4
R_PDFVIEWER /usr/bin/open
R_PLATFORM x86_64-apple-darwin15.6.0
R_PRINTCMD lpr
R_QPDF /Library/Frameworks/R.framework/Resources/bin/qpdf
R_RD4PDF times,inconsolata,hyper
R_SESSION_TMPDIR /var/folders/xw/402kc2hc8xl82d008k8x64f00000gq/T//RtmpJdct7Y
R_SHARE_DIR /Library/Frameworks/R.framework/Resources/share
R_SYSTEM_ABI osx,gcc,gxx,gfortran,?
R_TEXI2DVICMD /usr/local/bin/texi2dvi
R_UNZIPCMD /usr/bin/unzip
R_ZIPCMD /usr/bin/zip
SECURITYSESSIONID 186a8
SED /usr/bin/sed
SHELL /bin/zsh
SHLVL 0
SSH_AUTH_SOCK /private/tmp/com.apple.launchd.UNOOV1wxev/Listeners
SUBLIMEREPL_AC_IP 127.0.0.1
SUBLIMEREPL_AC_PORT None
TAR /usr/bin/tar
TMPDIR /var/folders/xw/402kc2hc8xl82d008k8x64f00000gq/T/
TZ America/Los_Angeles
USER geoffreysnyder
XPC_FLAGS 0x0
XPC_SERVICE_NAME 0
__CF_USER_TEXT_ENCODING 0x1F7:0x0:0x0
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.14.2
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 RJDBC_0.2-7.1 rJava_0.9-10 DBI_1.0.0 fuzzyjoin_0.1.4 readr_1.2.0 dplyr_0.7.8
[8] lubridate_1.7.4 stringr_1.3.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 tidyr_0.8.2 assertthat_0.2.0 R6_2.3.0 magrittr_1.5 pillar_1.2.3
[7] rlang_0.3.0.1 stringi_1.2.4 tools_3.5.1 glue_1.3.0 purrr_0.2.5 hms_0.4.2.9000
[13] compiler_3.5.1 pkgconfig_2.0.2 bindr_0.1.1 tidyselect_0.2.5 tibble_1.4.2
An observation from my experience: I was doing a fuzzy join and ran out of RAM, but the largest dataframe was only 200,000 rows. I subsetted the two dataframes by a common identifier, did the fuzzy join for each subset, then looped across the list of identifiers - this worked very quickly. Maybe someone could check the efficiency of code across larger examples? I'm assuming making a reprex for big data examples is a hassle.
Similar as markbneal above, I was doing my first fuzzy join and ran into a vector memory exhausted error. I was doing it through a purrr::map step, joining a dataframe with about 50,000 rows onto individual rows of a dataframe with 5,000 rows. My solution was to re-write it as a for loop.
Very similar here, I was doing a fuzzy_join of 43MB file to a 68KB one, and at its peak R used 12GB of ram (almost 300 times more than individual objects!)