spacyr icon indicating copy to clipboard operation
spacyr copied to clipboard

unusally high memory footprint

Open cschwem2er opened this issue 4 years ago • 3 comments

Hi,

I'm using the latest CRAN version of spacyr / spacy installer. For a dataset of 300k documents (~ email length, about 3GB file size uncompressed) I am using spacyr for lemmatization. The result is an insanely high memory footprint:

image

Unfortunately I can't share the dataset for reproduction, but if I can help to find out what's going on in any other way please let me know :)

cschwem2er avatar Dec 06 '20 19:12 cschwem2er

I've noticed similar patterns too. It would also be very interesting to compare this to the memory usage when parsing these texts in spaCy in Python. @amatsuo want to run some tests? I wonder if this is spaCy or whether it's reticulate.

@cschwem2er maybe batching the parsing would solve this?

kbenoit avatar Dec 06 '20 20:12 kbenoit

Thanks Ken for the fast response and yes, I used batching as a workaround before and it did the trick (so does buying more RAM =D).

cschwem2er avatar Dec 06 '20 20:12 cschwem2er

This issue may be due to multithread = TRUE. I've used spacyr a lot over the past couple of months and whenever multithread = TRUE (regardless of corpus) the memory usage increases drastically over multithread = FALSE. The inbuilt multithreading also doesn't spawn any additional processes (that would be detectable via top on a Linux machine).

I'm fairly certain this is somehow related to #206 and multithreading is not working as intended, eating up massive amounts of RAM instead of parallelizing the calculations...

My setup is Fedora 34, running on an AMD Ryzen 7 3700X, using spacyr_1.2.1. I'm happy to supply smaller and larger corpora to test this, but I believe this is a spacyr issue, not a data issue. A good testing corpus (not too large) might be this one (is in German, though): https://doi.org/10.5281/zenodo.3902658

I did succeed in building a parallelized workaround by setting multithread = FALSE and adding a doParallel/foreach framework on top: https://github.com/SeanFobbe/R-fobbe-proto-package/blob/main/f.dopar.spacyparse.R The same approach with a future front/backend fails because of non-exportable objects. Not sure why this doesn't affect the doParallel approach.

SeanFobbe avatar Jan 25 '22 09:01 SeanFobbe

We are aware of these issues and are (finally!) getting around to addressing them in #185. spaCy has also improved in this regard too. Hope to have solutions soon.

kbenoit avatar Sep 01 '22 09:09 kbenoit