correlation icon indicating copy to clipboard operation
correlation copied to clipboard

Clarify `multilevel` argument

Open bwiernik opened this issue 1 year ago • 5 comments

I find the multilevel argument name confusing, and there have been several issues from users lately that have expressed similar confusion.

Based on the name, I would expect a decomposition of the correlation matrix into between-groups and within-groups components, similar to psych::statsBy(). The between-groups component is correlations among group means, the within-groups component is the pooled within-group correlation matrix (computed as the correlations among group-mean-centered variables). This is what is typically meant in my experience (at least in psychology circles) by phrases like "multilevel factor analysis", "multilevel SEM", or "multilevel correlations".

The multilevel argument computes what is effectively the within-groups component described above, but estimated using random effects (random intercepts for group) rather than fixed effects (group-mean-centering or including groups as dummy-coded variables). Both fixed and random specifications of this adjustment are "multilevel" in the sense that they are estimating average within-group correlations, but we currently do not report the between component of the multilevel correlations in either specification.

I think it would be clearer for the argument to be named something like random_factors. This would make it clearer to me that what this argument is switching is how factors are partialed out.

Estimating correct point estimates/df/p/CIs for both within-group and between-group correlations is easy for fixed factor controls (known analytic solutions).

For random factor controls, we can get reasonable point estimates/df/p/CIs for within-correlation using our current estimation approach and some choice of profile likelihood or DoF approximation, or we can be close enough I'd argue by just using the fixed effects df. For between-correlations, we can either (1) pivot to a long format and fit a model with 0 + name + (0 + name | id) and get the correlation from there, then use profile likelihood for the CI, or (2) use our current estimation approach, estimate random effects for persons, and then compute the correlations among those post-hoc, using the fixed effects df. The second option there is probably close enough.

bwiernik avatar Aug 10 '22 16:08 bwiernik

I think it would be clearer for the argument to be named something like random_factors. This would make it clearer to me that what this argument is switching is how factors are partialed out.

yeah that sounds good to me. Though we should do a soft deprecation first with a warning and leave it for some time (as this is probably quite a popular feature of the package).

Interestingly, I had the same confusion about multilevel factor/SEM analysis. For me, and in my field, "multilevel" is used as a synonym for mixed models (random factors models). And some day I wanted to have RE in my SEM and FA, so I looked for it and was thrilled when I saw multilevel FAs... followed by a disappointment when I understood it was "just" a stratified analysis. So I can understand how users coming from the opposite direction would have the same confusion...

So yeah, making things more explicit is good. We should probably think on overhauling the whole factor treatment, we could have multiple arguments like factors_ignored=NULL (by default, gets filled with all the factors of the df), factors_fixed (previous include_factors), factors_random (for factors to be random), and then some argument to also include random slopes (https://github.com/easystats/datawizard/issues/203)

DominiqueMakowski avatar Aug 11 '22 00:08 DominiqueMakowski

I'm thinking of shifting to an explicit declaration of which variables should be partialed or semipartialed, which would make a lot of the arguments easier to manage together

bwiernik avatar Aug 11 '22 02:08 bwiernik

Is this still relevant? As you mentioned, I also believe the current implementation may be a bit confusing for people working with multilevel data who expect a decomposition into within- and between-group variance, similar to what the psych::statsBy package achieves. To address this issue, I have been working on some code that achieves this (as I don't like the output of the psych::statsBy package that much and didn't understand how it achieved it's estimates) and was wondering if this would be helpful.

My script basically centres variables within- and between clusters (similar to bmlm::isolate) and calculates both the within- and between correlations. Confidence intervals and p-values can be calculated and adjusted by using fisher's z-transformation. With my implementation I achieve pretty much identical results to the psych::statsBy implementation and I do find the within- and between correlation estimates as specified in a simulation I wrote. Do you think this would fit in somehow with the "multilevel" argument? Or perhaps this could be an additional feature?

Pascal-Kueng avatar Apr 21 '23 01:04 Pascal-Kueng

I could easily see this as a feature. I think it would be nice as either a different correlation method or a separate function, because if I understand it has a different output: it returns two correlation indices (between and within) is that correct?

I'm thinking of shifting to an explicit declaration of which variables should be partialed or semipartialed,

I agree with that, moving forward we probably would need to rethink how to make it API more explicit and flexible and less confusing

DominiqueMakowski avatar Apr 21 '23 06:04 DominiqueMakowski

Yes that's exactly right. For example, it could return one correlation matrix (and table with other statistics) for the within- correlations and a second separate correlation matrix for the between- correlations. I also think a nice implementation of the summary() method would be a correlation table with the within- correlations above the diagonal and the between- correlations below the diagonal. I think I could provide a class and some functions that would achieve all this. As the structure is different than for all other correlations, perhaps a standalone class could be better?

Pascal-Kueng avatar Apr 21 '23 12:04 Pascal-Kueng