coder
coder copied to clipboard
Error when working with big dataset
Hi and thank you for your work on the coder
package. I ran into issues while applying the categorize
function to a fairly large dataframe (~4GB). The function returns the following error message:
Error in copybig(x, .copy) :
Object is > 1 GB. Set argument 'copy' to TRUE' or FALSE to declare wether it should be copied or changed by reference!
But there seems to be no way (judging from the documentation) to actually set the copy
argument. I've tried including either copy = TRUE
or .copy = TRUE
to my calls to categorize()
, in both cases without effects. Is there another way to address the issue?
Dear @dtgnn!
I am very happy to hear that you are using te package! And thank you very much! I realize the documentation here should be improved!
It is a little complex since the .copy
argument is used by the coder::copybig()
function but passed from coder::categorize()
via coder::codify()
before it gets there. Hence, it is not documented in ?categorize
but in ?copybig
and in ?codify
. Anyway, arguments passed from coder::categorize()
to coder::codify()
must be wrapped in a list such as: categorize(..., codify_args = list(.copy = <TRUE/FALSE>))
. This is because categorize()
can pass arguments both between its methods, as well as both to codify()
and set_classcodes()
(if x
is of class data.table
).
Please let me know, if this work or not!?
Thank you for your message, @eribul. With your input I now see that the .copy
argument is indeed listed in the codify()
help page... my bad! I'll try to amend my code and report back with the results.
Hello @eribul,
Just a quick update to say that I tried passing both options to codify()
, but neither seemed to handle my large dataframe well.
Using categorize(..., codify_args = list(.copy = FALSE))
produced the following error:
Error: cannot allocate vector of size 1.3 Gb
Error during wrapup: cannot allocate vector of size 1.4 Gb
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
Using categorize(..., codify_args = list(.copy = TRUE))
brought the R session to eat up all my available memory (>100GB); I interrupted the process to avoid the session to crash.
I have resorted to slicing my dataset and iterating over the samples. It seems to do the job.
Thank you again for your help!
I am sorry to here that!
Is it possible, however, that you might be running a 32 bit version of R? If so, I might suspect that the 1.4 Gb limit might be caused by that, and not by your actual RAM. If you are unsure you can type R.version$arch
in the console to find out. (It is also stated on the third line of the start up message when you start R). If possible, I would sugest to use a 64 bit version of R.
And just to rule out the obvious; the > 100 GB is your RAM (not your disk memory) right? :-)
R version x86_64. 100GB of RAM.