coder icon indicating copy to clipboard operation
coder copied to clipboard

Error when working with big dataset

Open dtgnn opened this issue 3 years ago • 5 comments

Hi and thank you for your work on the coder package. I ran into issues while applying the categorize function to a fairly large dataframe (~4GB). The function returns the following error message:

Error in copybig(x, .copy) : 
  Object is > 1 GB. Set argument 'copy' to TRUE' or FALSE to declare wether it should be copied or changed by reference!

But there seems to be no way (judging from the documentation) to actually set the copy argument. I've tried including either copy = TRUE or .copy = TRUE to my calls to categorize(), in both cases without effects. Is there another way to address the issue?

dtgnn avatar Dec 07 '21 00:12 dtgnn

Dear @dtgnn!

I am very happy to hear that you are using te package! And thank you very much! I realize the documentation here should be improved!

It is a little complex since the .copy argument is used by the coder::copybig() function but passed from coder::categorize() via coder::codify() before it gets there. Hence, it is not documented in ?categorize but in ?copybig and in ?codify. Anyway, arguments passed from coder::categorize() to coder::codify() must be wrapped in a list such as: categorize(..., codify_args = list(.copy = <TRUE/FALSE>)). This is because categorize() can pass arguments both between its methods, as well as both to codify() and set_classcodes() (if x is of class data.table).

Please let me know, if this work or not!?

eribul avatar Dec 07 '21 20:12 eribul

Thank you for your message, @eribul. With your input I now see that the .copy argument is indeed listed in the codify() help page... my bad! I'll try to amend my code and report back with the results.

dtgnn avatar Dec 07 '21 21:12 dtgnn

Hello @eribul,

Just a quick update to say that I tried passing both options to codify(), but neither seemed to handle my large dataframe well.

Using categorize(..., codify_args = list(.copy = FALSE)) produced the following error:

Error: cannot allocate vector of size 1.3 Gb
Error during wrapup: cannot allocate vector of size 1.4 Gb
Error: no more error handlers available (recursive errors?); invoking 'abort' restart

Using categorize(..., codify_args = list(.copy = TRUE)) brought the R session to eat up all my available memory (>100GB); I interrupted the process to avoid the session to crash.

I have resorted to slicing my dataset and iterating over the samples. It seems to do the job.

Thank you again for your help!

dtgnn avatar Dec 08 '21 22:12 dtgnn

I am sorry to here that!

Is it possible, however, that you might be running a 32 bit version of R? If so, I might suspect that the 1.4 Gb limit might be caused by that, and not by your actual RAM. If you are unsure you can type R.version$arch in the console to find out. (It is also stated on the third line of the start up message when you start R). If possible, I would sugest to use a 64 bit version of R.

And just to rule out the obvious; the > 100 GB is your RAM (not your disk memory) right? :-)

eribul avatar Dec 09 '21 20:12 eribul

R version x86_64. 100GB of RAM.

dtgnn avatar Dec 10 '21 05:12 dtgnn