simstudy `genCorGen()` simulating data with all lower correlation than actual data (than expected?)

Hi,

Thanks for the great package!

I'm attempting to simulate some binary data and noticing that the simulated data correlation matrix coefficient values appear to consistently come out less than expected.

I have real binary data. From this I calculated the correlation matrix and use that along with the real probabilities of the binary data to generate the new dataset.

I am unclear if this behaviour is to be expected or not. Since these original correlation coefficients were generated from "real" data I don't understand why it wouldnt be possible to replicate the same correlation matrix in the simulated data. Perhaps I am using the function incorrectly or there is some statistical property that I am not fully grasping. Any clarification, help/suggestion would be greatly appreciated.

Below is a reprex to illustrate:

library(dplyr)
library(ggplot2)
library(tidyr)

# some "real/actual" binary data
df <- tibble::tribble(
  ~V1, ~V2, ~V3, ~V4, ~V5, ~V6, ~V7, ~V8,
  0, 0, 0, 0, 0, 0, 0, 0,
  1, 0, 1, 0, 1, 0, 1, 0,
  1, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 1, 0, 1, 0, 0, 0,
  0, 1, 1, 1, 0, 1, 1, 1,
  0, 1, 1, 1, 1, 1, 1, 1,
  0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0,
  0, 1, 0, 1, 0, 1, 0, 1,
  1, 0, 0, 1, 0, 1, 0, 1,
  1, 1, 1, 1, 1, 1, 1, 1,
  1, 1, 1, 0, 1, 0, 1, 0,
  1, 0, 0, 0, 0, 0, 0, 0,
  1, 1, 1, 1, 1, 1, 1, 1,
  0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0,
  0, 1, 1, 1, 1, 1, 1, 1,
  0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0,
  1, 0, 0, 0, 0, 0, 0, 0,
  1, 0, 1, 0, 1, 0, 1, 0,
  0, 0, 0, 0, 0, 0, 0, 0,
  1, 0, 1, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0,
  1, 1, 1, 1, 1, 1, 1, 1,
  0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0,
  1, 1, 0, 1, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0,
  1, 1, 1, 1, 1, 1, 1, 1,
  1, 1, 1, 1, 1, 1, 1, 1,
  0, 1, 0, 0, 0, 0, 0, 0,
  0, 1, 0, 0, 0, 0, 0, 0,
  1, 1, 1, 1, 1, 1, 1, 1,
  1, 1, 0, 1, 1, 1, 0, 1,
  0, 0, 0, 0, 0, 0, 0, 0,
  0, 1, 0, 1, 0, 1, 0, 0,
  0, 0, 0, 0, 0, 1, 0, 0
)

# calculate probabilities (i.e mean of each col)
df_probs <- df |>
  summarise(
    across(everything(), mean)
  ) |>
  pivot_longer(
    cols = everything(),
    values_to = "prob"
  )

# create a correlation matrix
actual_cor_matrix <- cor(df)

# also played with this, but didn't solve the issue
# actual_cor_matrix <- psych::tetrachoric(data.matrix(df),smooth = TRUE)$rho


sim_binary <- simstudy::genCorGen(
  n = 100000,
  nvars = nrow(df_probs),
  params1 = df_probs$prob,
  corMatrix = actual_cor_matrix,
  cnames = colnames(actual_cor_matrix),
  dist = "binary",
  method = "copula",
  wide = TRUE
)
sim_cor_matrix <- cor(sim_binary[, -1]) # remove id column

You can see that all simulated data coefficient value are less than actual data correlation coefficient values by subtracting the correlation matrices and seeing all negative values (except 0)

sim_cor_matrix - actual_cor_matrix
#>            V1         V2         V3         V4         V5         V6         V7
#> V1  0.0000000 -0.1053751 -0.1719094 -0.1384342 -0.1863495 -0.1052367 -0.1751274
#> V2 -0.1053751  0.0000000 -0.1696604 -0.2201598 -0.1841508 -0.2183775 -0.2060719
#> V3 -0.1719094 -0.1696604  0.0000000 -0.1649232 -0.2100745 -0.1649136 -0.2029969
#> V4 -0.1384342 -0.2201598 -0.1649232  0.0000000 -0.1772885 -0.2005001 -0.1992614
#> V5 -0.1863495 -0.1841508 -0.2100745 -0.1772885  0.0000000 -0.1805726 -0.2141742
#> V6 -0.1052367 -0.2183775 -0.1649136 -0.2005001 -0.1805726  0.0000000 -0.1992070
#> V7 -0.1751274 -0.2060719 -0.2029969 -0.1992614 -0.2141742 -0.1992070  0.0000000
#> V8 -0.1407307 -0.2254219 -0.1982286 -0.2082238 -0.2073274 -0.2076596 -0.2201022
#>            V8
#> V1 -0.1407307
#> V2 -0.2254219
#> V3 -0.1982286
#> V4 -0.2082238
#> V5 -0.2073274
#> V6 -0.2076596
#> V7 -0.2201022
#> V8  0.0000000

Another visual comparison

df_cor_actual_long <- actual_cor_matrix |>
  data.frame() |>
  tibble::rownames_to_column(var = "framework") |>
  pivot_longer(-framework, names_to = "framework_compare", values_to = "correlation") |>
  mutate(
    type = "actual"
  )

df_cor_sim_long <- sim_cor_matrix |>
  data.frame() |>
  tibble::rownames_to_column(var = "framework") |>
  pivot_longer(
    -framework,
    names_to = "framework_compare",
    values_to = "correlation"
  ) |>
  mutate(
    type = "simulated"
  )

df_correlations_compare <- bind_rows(
  df_cor_actual_long,
  df_cor_sim_long
)



df_correlations_compare |>
  pivot_wider(
    names_from = type,
    values_from = correlation
  ) |>
  ggplot(aes(x = actual, y = simulated)) +
  geom_point(alpha = 0.3) +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  ) +
  labs(
    x = "ORIGINAL: correlation matrix coefficients",
    y = "SIMULATED: correlation matrix coefficients"
  )

^{Created on 2024-09-05 with reprex v2.1.0}

Sep 05 '24 18:09 zackarno

That is just the nature of the copula algorithm. The transformation from continuous data (where the data are generated) to the binary outcome loses a lot of information, so correlation necessarily decreases. The same is true for Poisson data, though the loss is not as substantial. However, in the case of binary data, there is another algorithm: if you specify method = "ep", you should get much better results. Let me know if that improves things.

Sep 05 '24 18:09 kgoldfeld

Wow the method = "ep" does make the correlation approximations much closer to those of the original data correlations. I had seen the "ep" option in the documentation, but discounted it due to an "range" error I was getting in my actual analysis code (not reprex). However, after revisiting I've got it working and got passed the error message by a little bit of pre-processing removing some rows with NA values that had gotten accidentally introduced (so i think that was the issue).

Thanks so much for your answer. So if I understand correctly, you are saying the copula algorithm looses information because it initially generates/simulates continuous data and then thresholds it to binary? Assuming I got that right, this makes alot of sense to me now and reminds me of what I had read about it prior to finding your package. I guess I should read up a bit more on how the "ep" algorithm works, but for now I am glad that it seems more fit for my purpose.

FYI here is the same plot as my previous post, but this time generated using the "ep" method.

^{Created on 2024-09-05 with reprex v2.1.0}

Sep 05 '24 20:09 zackarno

That is an improvement - and yes, your intuition is correct.

I had a post about the method a while back, with a link to the Emrich & Piedmonte paper.

Sep 05 '24 20:09 kgoldfeld