simstudy icon indicating copy to clipboard operation
simstudy copied to clipboard

pass-by-value vs. pass-by-reference

Open assignUser opened this issue 3 years ago • 14 comments

I was just looking through the code and I noticed something changed that probably shouldn't have. Whenever we pass a data.table as an argument to a function, it is imperative to copy the data table to another data.table. So you might see something like:

foo <- function(dtName, ...) {
   dT <- copy(dtName)
   .
   .
}

This addresses a weird R problem (and maybe there is a better work around, but this was the only way I could figure out) where the original data.table gets changed by the function call even if that was not the intention. For example:

library(data.table)

dd <- data.table(x = c(1, 2, 3), y = c(3, 4, 5))
dd
#>    x y
#> 1: 1 3
#> 2: 2 4
#> 3: 3 5

# doesn't do what you think it does

chk <- function(d) {
  d[, z := c(9, 10, 11)]
  d[]
}

chk(dd)
#>    x y  z
#> 1: 1 3  9
#> 2: 2 4 10
#> 3: 3 5 11
dd
#>    x y  z
#> 1: 1 3  9
#> 2: 2 4 10
#> 3: 3 5 11

# this does

dd <- data.table(x = c(1, 2, 3), y = c(3, 4, 5))
dd
#>    x y
#> 1: 1 3
#> 2: 2 4
#> 3: 3 5

chk2 <- function(d) {
  d2 <- copy(d)
  d2[, z := c(9, 10, 11)]
  d2[]
}

chk2(dd)
#>    x y  z
#> 1: 1 3  9
#> 2: 2 4 10
#> 3: 3 5 11
dd
#>    x y
#> 1: 1 3
#> 2: 2 4
#> 3: 3 5

Originally posted by @kgoldfeld in https://github.com/kgoldfeld/simstudy/pull/49#issuecomment-698380875

assignUser avatar Sep 24 '20 14:09 assignUser

@kgoldfeld That is intended behavior by data.table as I understand. The terminology is pass-by-value vs. pass-by-reference. R almost exclusively works via pass-by-value, which means that arguments to functions are copied when they are passed on. This avoids side-effects like you describe here but also lowers performance due to the copying and increased memory usage. See this Stackoverflow answer for more general information.

As data.table is specifically made for huge data sets they implement changes via reference (e.g. reordering columns etc.) because they don't have to copy stuff around.

I did not remove calls to dt <- copy(dt) as far as I know. Do you want to do this more consistently throughout simstudy?

assignUser avatar Sep 24 '20 14:09 assignUser

I did see in genOrdCat that you removed dt <- copy(dtName) and you continue to work with dtName throughout, not dt (since it does not exist anymore). Was that your intention?

kgoldfeld avatar Sep 24 '20 15:09 kgoldfeld

I can blame you for that 😝 because I just reused the body of genCorOrdCat where you don't seem to do the copy thing: https://github.com/kgoldfeld/simstudy/blob/84d995dc0684d7a790633bccf81dad480915ba85/R/generate_correlated_data.R#L622

But what ever the case let us double check all functions and make it the default to copy the incoming data.table. We could add an option (e.g. environment variable) to deactivate that behavior for people who work with very large simulations. That way we could preserve the speed gains but make it a conscious choice.

assignUser avatar Sep 24 '20 15:09 assignUser

You are so right - I just looked at the release version. Not sure how I made that mistake, but I did. And then I accused you of it - doubly bad!

I'm not sure it is necessary to deactivate that behavior - since I generally test all the functions generating a million observations to see if things get too bogged down. Just seems like that would add complexity for the user. Now, if people started complaining, then I guess we could do it.

kgoldfeld avatar Sep 24 '20 15:09 kgoldfeld

I had a quick look and at the moment many but not all functions copy their input data.table. I would argue for creating one consistent behavior across all simstudy functions, either with or without copying. Consistency makes for a good user experience 😸

assignUser avatar Sep 24 '20 15:09 assignUser

I totally agree - we should copy in all cases where it is possible the user will be creating a new data.table as a result of the call to the function are their intention is to change the original data set.

kgoldfeld avatar Sep 24 '20 15:09 kgoldfeld

ok I will look into it

assignUser avatar Sep 24 '20 16:09 assignUser

I will think about this some more. For now lets post pone it after 0.2.0.

assignUser avatar Oct 05 '20 16:10 assignUser

I agree - probably not causing too many problems. But, I am sure we will hear about it if it is.

kgoldfeld avatar Oct 05 '20 16:10 kgoldfeld

I restored the old behaviour of genOrdCat in the trtAssign pull request :) after that is merged I will start the rhub checks.

assignUser avatar Oct 05 '20 16:10 assignUser

I still think we should default to one behavior for all function but maybe it would make sense to add a parameter option to turn the copying off/on. I think we should adress this after #79 so we can make sure that performance is not addressed negatively.

assignUser avatar Apr 04 '21 17:04 assignUser

I would want it to default to copying to a new data.table, but I am not totally convinced that anyone would ever want to turn it off (and allow changes to the data.table being passed as an argument, which just seems like weird behavior).

kgoldfeld avatar Apr 04 '21 19:04 kgoldfeld

I'm with you on the default as this is (mostly) the simstudy default anyway.

Well the in-place changes are the default behavior for data.table for performance reasons. So maybe for very large datasets this could be helpful? It would at least allow an option for heavy dt users to keep the default behavior of dt they are used to?

assignUser avatar Apr 05 '21 15:04 assignUser

I think an option would be fine for those folks.

kgoldfeld avatar Apr 05 '21 15:04 kgoldfeld