analysis-pipelines icon indicating copy to clipboard operation
analysis-pipelines copied to clipboard

Outputs saved from pipeline saved as RDS are significantly larger than they should be

Open husamrahman opened this issue 5 years ago • 3 comments

I have an object that's created by a function. in Rstudio, the object itself is around 3GB. When I save it as an RDS, the size is ~800MB. When I use the same function but part of the pipeline process and extract that specific object from the output of the pipeline and save it, the RDS size becomes 5GB+. It seems like there are dependencies of some sort when saving the RDS. Do I need to do something specific to remove the additional meta data?

Walkthrough:

x <- Result_From_Some_Function() saveRDS(x) ~ 800MB

x <- output_from_pipeline@result$f1 saveRDS(x) ~ 5GB+

husamrahman avatar Mar 22 '19 21:03 husamrahman

This should not be the case. The object stored as output from the pipeline is exactly the same as a direct call to the function, within the R session. There might be minor differences when saving to RDS but definitely not of the order you have mentioned.

I verified the same with a simple example of a plot on iris data, using object.size() to check the size of the object in memory.

Could you share a reproducible example where you are facing this issue? Does this happen for outputs for all functions or this specific function? What is the class of the object that is returned from the function where you are facing this issue.

naren1991 avatar Mar 23 '19 04:03 naren1991

The specific object being generated is a random forest model built using the randomforestSRC package. I tested this multiple times and can reproduce it every time. Is there potential conflicts with this package?

husamrahman avatar Mar 25 '19 22:03 husamrahman

This helps. No known conflicts, but might be a problem specific to this object. I will investigate further.

naren1991 avatar Apr 01 '19 14:04 naren1991