spacyr
spacyr copied to clipboard
unusally high memory footprint
Hi,
I'm using the latest CRAN version of spacyr / spacy installer. For a dataset of 300k documents (~ email length, about 3GB file size uncompressed) I am using spacyr for lemmatization. The result is an insanely high memory footprint:
Unfortunately I can't share the dataset for reproduction, but if I can help to find out what's going on in any other way please let me know :)
I've noticed similar patterns too. It would also be very interesting to compare this to the memory usage when parsing these texts in spaCy in Python. @amatsuo want to run some tests? I wonder if this is spaCy or whether it's reticulate.
@cschwem2er maybe batching the parsing would solve this?
Thanks Ken for the fast response and yes, I used batching as a workaround before and it did the trick (so does buying more RAM =D).
This issue may be due to multithread = TRUE
. I've used spacyr a lot over the past couple of months and whenever multithread = TRUE
(regardless of corpus) the memory usage increases drastically over multithread = FALSE
. The inbuilt multithreading also doesn't spawn any additional processes (that would be detectable via top
on a Linux machine).
I'm fairly certain this is somehow related to #206 and multithreading is not working as intended, eating up massive amounts of RAM instead of parallelizing the calculations...
My setup is Fedora 34, running on an AMD Ryzen 7 3700X, using spacyr_1.2.1. I'm happy to supply smaller and larger corpora to test this, but I believe this is a spacyr issue, not a data issue. A good testing corpus (not too large) might be this one (is in German, though): https://doi.org/10.5281/zenodo.3902658
I did succeed in building a parallelized workaround by setting multithread = FALSE
and adding a doParallel/foreach framework on top: https://github.com/SeanFobbe/R-fobbe-proto-package/blob/main/f.dopar.spacyparse.R The same approach with a future front/backend fails because of non-exportable objects. Not sure why this doesn't affect the doParallel approach.
We are aware of these issues and are (finally!) getting around to addressing them in #185. spaCy has also improved in this regard too. Hope to have solutions soon.