simstudy
simstudy copied to clipboard
Use R-expression for formula arguments
I guess a more R-way (which is a matter of taste) would be to specify formula = -2+age*0.1, i.e., as an R expression, see ?deparse and ?substitute. The same with varname and dist. See base R functions transform() and subset() for inspiration.
As brought up by gagolews in openjournals/joss-reviews/issues/2763 we should think about this long term. It is a breaking change so not a small issue.
Good luck! :+1:
@kgoldfeld would changing the order of the definition table columns cause problems in you workflow or do you think for other reasons it is a bad idea? I would like to change from:
varname formula variance dist link
to
varname dist formula param1 param2
this does not have technical reasons, I just think this order is more intuitive with the renamed variance/link.
I think that would would work though I was also thinking of possibly this:
varname dist param1 param2 link
where param1
replaces formula
and param2
replaces variance
. link
is not really a parameter of a distribution, but the relationship that defines the right and left hand sides of the equation. The obvious flaw is with the the new trtAssign distribution, link
doesn't make so much sense. But, I think it is more confusing to have logit be a value for a parameter. I coud live with the link
being identity (and not really needed to be specified) or nonbalanced for the *trtAssign
dist`.
We could also consider the R
specification for the distribution and link (e.g. in glm
) to be combined, so that we eliminate the link
field. So, dist = binary(logit)
or dist = binary()
- in the case where we don't change the name of binary to binomial. Or dist = trtAssign()
or dist = trtAssign(balanced)
vs. dist = trtAssign(nonbalanced)
. If no link is provided, glm
also allows you to specify without parentheses, so dist = trtAssign
.
Hm. Why not go all out then and do param1 param2 param3
to make it distribution agnostic?
The problem with what you write in your second paragraph is backwards compatibility, which would be hard to uphold I think. At least with the way I have it planned currently... I'll think about it
I think I prefer sticking with link
instead of param3
, for the reasons stated earlier. I guess the confusing thing is that one can think of a parameter as a statistical parameter (ad I do) or as an argument to a function (as I think you might are). Maybe we can come up with a more generic term, but less generic than arg1
, arg2
, and arg3
. Maybe thing1
, thing2
, and thing3
:).
I see where you are coming from... and again it shows that this is true:
There are only two hard things in Computer Science: cache invalidation and naming things. -- Phil Karlton
arg brg crg
xD
As a reference here my current poc https://gist.github.com/assignUser/af104a19c2b6819043bd7e290117e5cd Generation is not finished yet but defining stuff works for now. At the moment I feel that it is a bit clunky with the generics and having to construct an object that is used just once... Maybe I can think of something smoother.
We want to keep the definition tables though right? Otherwise having different params for each dist wouldn't really be a problem, e.g. with a list instead of a dt.
The table was literally the underlying motivation for the whole package, so that might be too much for me to give up.
I though so and didn't want to get rid of the idea of documenting the data generation process, just using another form factor. But I am not sure what that would look like...
How about adding another field - distArg
or dist_arg
or arg
or arg1
, so we would have
varname dist param1 param2 link arg1
So, balanced would be a value for arg1
in the trtAssign dist. And this would add a little flexibility in allowing for alternative specifications for some distributions - say the gamma distribution, which now uses a mean/dispersion specification, but the traditional way is using rate/scale. We could use arg1
to allow users to indicate what version they are using (in the case of the gamma dist, though this is also possible for the beta distribution as well). Also, in the categorical distribution, the alternative categories now specified in variance
could be moved to arg1
. I understand that all of this has backwards mapping compatibility issues. But just a thought.
I published a first sketch of how this could work with using S3 methods here: https://github.com/kgoldfeld/simstudy/blob/issue-75-no-more-strings/R/no_strings.R I currently think that it is pretty cumbersome internally and it might restrict us in our options. I have some other Ideas I am going to test soon.