simstudy icon indicating copy to clipboard operation
simstudy copied to clipboard

Correlated data from multiple different distributions

Open tombisho opened this issue 4 years ago • 4 comments

Thank you for this excellent package.

I have a dataset which consists of 5 continuous variables and 5 categorical variables. I can generate a correlation matrix for this data set*, along with means/SDs of the continuous and counts for the categorical variables.

At the moment it looks like I can build 2 different simstudy datasets, one using the correlations between the continuous variables, their means and SDs, and another using the same technique for the categorical variables. However, I don't see how I can make use of the correlations between the continuous and categorical variables to generate a complete dataset that

It may be that I am not using simstudy correctly in whcih case I would appreciate any advice on how I can do what I have described above.

*forgive my stats naivety if this is not a valid thing to do

tombisho avatar Jul 30 '21 12:07 tombisho

Thanks for your note - it would be helpful if you shared the code that you are currently using.

kgoldfeld avatar Aug 02 '21 14:08 kgoldfeld

Yes you are right, sorry for not following the guidance! Here is something that might help illustrate:

library("simstudy")

cont_data = mtcars[,-which(names(mtcars) %in% c("cyl","vs","am","gear","carb"))]
cols = colnames(cont_data)
corrs = cor(x=cont_data)
means = colMeans(x=cont_data)
sds = apply(cont_data,2,sd)

dd <- genCorData(n = 40, mu = means, sigma = sds, corMatrix = corrs, cnames = cols)

I can use simstudy to build a dataset that captures the properties and relationships between the continuous variables. But now I am stuck as to how I would apply this to the categorical and binary columns. It feels like I need to specify everything in one go to capture the relationships between all the variables, but I don't how I can do this with mixed distribution types.

Any thoughts would be greatly appreciated

tombisho avatar Aug 02 '21 15:08 tombisho

simstudy can accommodate generating correlated data from different distributions using the function genCorFlex (see here). However, the distributions are currently limited to "binary", "poisson", "gamma", "normal", and "uniform" distributions. There is currently also functionality to generate correlated ordinal (categorical) data using genOrdCat, but this has not been integrated with other types of distributions.

kgoldfeld avatar Aug 02 '21 15:08 kgoldfeld

OK that is great, thank you - I missed genCorFlex in the vignettes. Are there plans to add ordinal data to genCorFlex? I guess in the meantime one could convert the ordinal variables to binaries?

tombisho avatar Aug 02 '21 19:08 tombisho