simstudy icon indicating copy to clipboard operation
simstudy copied to clipboard

Use R-expression for formula arguments

Open assignUser opened this issue 3 years ago • 11 comments

I guess a more R-way (which is a matter of taste) would be to specify formula = -2+age*0.1, i.e., as an R expression, see ?deparse and ?substitute. The same with varname and dist. See base R functions transform() and subset() for inspiration.

As brought up by gagolews in openjournals/joss-reviews/issues/2763 we should think about this long term. It is a breaking change so not a small issue.

assignUser avatar Oct 21 '20 09:10 assignUser

Good luck! :+1:

gagolews avatar Oct 21 '20 10:10 gagolews

@kgoldfeld would changing the order of the definition table columns cause problems in you workflow or do you think for other reasons it is a bad idea? I would like to change from: varname formula variance dist link to varname dist formula param1 param2 this does not have technical reasons, I just think this order is more intuitive with the renamed variance/link.

assignUser avatar Nov 05 '21 09:11 assignUser

I think that would would work though I was also thinking of possibly this:

varname dist param1 param2 link

where param1 replaces formula and param2 replaces variance. link is not really a parameter of a distribution, but the relationship that defines the right and left hand sides of the equation. The obvious flaw is with the the new trtAssign distribution, link doesn't make so much sense. But, I think it is more confusing to have logit be a value for a parameter. I coud live with the link being identity (and not really needed to be specified) or nonbalanced for the *trtAssign dist`.

We could also consider the R specification for the distribution and link (e.g. in glm) to be combined, so that we eliminate the link field. So, dist = binary(logit) or dist = binary() - in the case where we don't change the name of binary to binomial. Or dist = trtAssign() or dist = trtAssign(balanced) vs. dist = trtAssign(nonbalanced). If no link is provided, glm also allows you to specify without parentheses, so dist = trtAssign.

kgoldfeld avatar Nov 05 '21 12:11 kgoldfeld

Hm. Why not go all out then and do param1 param2 param3 to make it distribution agnostic? The problem with what you write in your second paragraph is backwards compatibility, which would be hard to uphold I think. At least with the way I have it planned currently... I'll think about it

assignUser avatar Nov 05 '21 13:11 assignUser

I think I prefer sticking with link instead of param3, for the reasons stated earlier. I guess the confusing thing is that one can think of a parameter as a statistical parameter (ad I do) or as an argument to a function (as I think you might are). Maybe we can come up with a more generic term, but less generic than arg1, arg2, and arg3. Maybe thing1, thing2, and thing3 :).

kgoldfeld avatar Nov 05 '21 13:11 kgoldfeld

I see where you are coming from... and again it shows that this is true:

There are only two hard things in Computer Science: cache invalidation and naming things. -- Phil Karlton

arg brg crg xD

assignUser avatar Nov 05 '21 13:11 assignUser

As a reference here my current poc https://gist.github.com/assignUser/af104a19c2b6819043bd7e290117e5cd Generation is not finished yet but defining stuff works for now. At the moment I feel that it is a bit clunky with the generics and having to construct an object that is used just once... Maybe I can think of something smoother.

We want to keep the definition tables though right? Otherwise having different params for each dist wouldn't really be a problem, e.g. with a list instead of a dt.

assignUser avatar Nov 05 '21 13:11 assignUser

The table was literally the underlying motivation for the whole package, so that might be too much for me to give up.

kgoldfeld avatar Nov 05 '21 14:11 kgoldfeld

I though so and didn't want to get rid of the idea of documenting the data generation process, just using another form factor. But I am not sure what that would look like...

assignUser avatar Nov 05 '21 18:11 assignUser

How about adding another field - distArg or dist_arg or arg or arg1, so we would have

varname dist param1 param2 link arg1

So, balanced would be a value for arg1 in the trtAssign dist. And this would add a little flexibility in allowing for alternative specifications for some distributions - say the gamma distribution, which now uses a mean/dispersion specification, but the traditional way is using rate/scale. We could use arg1 to allow users to indicate what version they are using (in the case of the gamma dist, though this is also possible for the beta distribution as well). Also, in the categorical distribution, the alternative categories now specified in variance could be moved to arg1. I understand that all of this has backwards mapping compatibility issues. But just a thought.

kgoldfeld avatar Nov 08 '21 15:11 kgoldfeld

I published a first sketch of how this could work with using S3 methods here: https://github.com/kgoldfeld/simstudy/blob/issue-75-no-more-strings/R/no_strings.R I currently think that it is pretty cumbersome internally and it might restrict us in our options. I have some other Ideas I am going to test soon.

assignUser avatar Feb 16 '22 12:02 assignUser