effectsize Improved d values for paired designs

Currently the cohens d (etc) values we compute for paired designs are mean standardized difference scores (dz). Those are rarely what folks think about when they say "Cohen's d" and can be misleadingly large. http://jakewestfall.org/blog/index.php/2016/03/25/five-different-cohens-d-statistics-for-within-subject-designs/

This limitation seems to produce confusion: https://twitter.com/utekleong/status/1432796123320578051?s=21

We should update our arguments here to include other paired d value options.

Aug 31 '21 21:08 bwiernik

Yes, have been thinking about this too...

[x] d_z Given by paired = TRUE
[x] d_t Essentially offered by t_to_d() (even have the option not to bias by sqrt(2)).
[x] d Isn't this just paired = FALSE?

With replications (Should only allow formula input)

[x] d_a is the same as "just d" after collapsing across replications. Should we offer to collapse for users?
[x] d_r Need to add.

Sep 01 '21 08:09 mattansb

The issue is that they confidence interval will be different. https://psycnet.apa.org/record/1996-04469-005

We could also consider adding an interface for pretest-post test-control group d values. https://journals.sagepub.com/doi/abs/10.1177/1094428106291059

As part of that, it might be useful to provide a more flexible interface for specifying the SD to use?

(Regard dz, I liked your characterization of it as the signal-to-noise ratio. Perhaps we should encourage that label to reduce ambiguity?)

Sep 01 '21 14:09 bwiernik

Always back to the CIs.... 😂

I think, on the development side, we need to first build a new internal function (not exported) that:

perps the data (can be a separate internal function)
Estimates these within/rm estimates.
Gets the correct CIs for the estimates.

After that is up and working, we can think about how to get that working from the user-facing cohens_d.

(That is, we will have two internal functions doing the heavy lifting - one for the classic between d and within dz, and the new one)

(Regard dz, I liked your characterization of it as the signal-to-noise ratio. Perhaps we should encourage that label to reduce ambiguity?)

Yes, we could definitely emphasize this in the docs (: (I so somewhat allude to it here)

Sep 02 '21 05:09 mattansb

I've started working on the new function on the paired_d branch. Currently only gives the estimated d (no CIs).

Replicates the results from Jake's post (marked beside each function call here):

dat <- read.table("http://pcl.missouri.edu/exp/effectSizePuzzler.txt", header=TRUE)

paired_d(rt ~ cond | id, data = dat, type = "d") # 0.2497971
#> d   
#> ----
#> 0.25
paired_d(rt ~ cond | id, data = dat, type = "a") # 0.8357347
#> d_a 
#> ----
#> 0.84
paired_d(rt ~ cond | id, data = dat, type = "z") # 1.353713
#> d_z 
#> ----
#> 1.35
paired_d(rt ~ cond | id, data = dat, type = "t") # withOUT sqrt(2) bias, is the same as d_z
#> d_t 
#> ----
#> 1.35
paired_d(rt ~ cond | id, data = dat, type = "r") # 0.259
#> boundary (singular) fit: see ?isSingular
#> d_r 
#> ----
#> 0.26

^{Created on 2021-09-05 by the reprex package (v2.0.1)}

Sep 05 '21 09:09 mattansb

I don't think we should include "t". It's just wrong

Sep 05 '21 11:09 bwiernik

At a broader level, rather than a type argument, would it be better to have a less opaque combination of arguments that produce the various types. Something like in cohens_d(). paired, aggregate, and SD (or maybe scale or standardizer)?

So if I want type "d", I do paired=T, aggregate="none", SD="pooled".

For type "a", I do paired=T, aggregate="mean", SD="pooled".

For type "z", I do paired=T, aggregate="difference", SD="pooled".

For type "r", paired=T, aggregate="none", SD="rmse" (or maybe residual or sigma).

There are two things like I like about this. One it makes it more transparent what choices the user is making. Two, we can use the same API to expand to other options besides these five.

For example, we could expand the SD argument to flexibly do glass's d in the same function by using SD="control" (with the value being the factor level if using formula interface), or we can flexibly handle a pretest-posttest-control group d (difference in difference scores, standardized by pooled pretest SDs).

Sep 05 '21 12:09 bwiernik

I think this can be simplified by having just SD?

Paired	SD	Out
`TRUE`	`"raw"`	d
	`"residuals"`	d_r
	`"means"`	d_a
	`"differences"`	d_z
`FALSE`	`"pooled"`	d with pooled
	`"unpooled"`	d with unpooled
	`"control"`	Glass' delta

Anyway, let's first get the CIs up and running, and then we can debate the interface / argument names (:

Sep 05 '21 12:09 mattansb

@bwiernik Do we have any leads for CIs for d, d_r and d_a?

I can resort to using emmeans::eff_size() but who wants that??

Oct 22 '21 18:10 mattansb

There are also Cohen's d(av) and d(rm) from https://doi.org/10.3389/fpsyg.2013.00863 which give different results than the 4 mentioned above...

Oct 23 '21 19:10 mattansb

For the original d types regular, a, r and z:

library(effectsize)

dat <- read.table("http://pcl.missouri.edu/exp/effectSizePuzzler.txt", header=TRUE)

paired_d(rt ~ cond | id, data = dat, type = "z") # 1.353713
#> Cohen's d |       95% CI
#> ------------------------
#> 1.35      | [0.82, 1.93]

paired_d(rt ~ cond | id, data = dat, type = "d") # 0.2497971
#> d    |       95% CI
#> -------------------
#> 0.25 | [0.18, 0.32]

paired_d(rt ~ cond | id, data = dat, type = "a") # 0.8357347
#> d_a  |       95% CI
#> -------------------
#> 0.84 | [0.60, 1.13]

paired_d(rt ~ cond | id, data = dat, type = "r") # 0.259
#> d_r  |       95% CI
#> -------------------
#> 0.26 | [0.19, 0.33]

For types regular, a, and r, CIs are bootstrapped: All are based on a merMod model, so using boorMer(). Seems to work out nicely (:

Need to take a look at those additional types av and rm from https://doi.org/10.3389/fpsyg.2013.00863.

Nov 27 '21 18:11 mattansb

I don't know, are 7 methods enough?

library(effectsize)

dat <- read.table("http://pcl.missouri.edu/exp/effectSizePuzzler.txt", header=TRUE)

# From http://jakewestfall.org/blog/index.php/2016/03/25/five-different-cohens-d-statistics-for-within-subject-designs/
paired_d(rt ~ cond | id, data = dat, type = "d") # 0.2497971
#> d    |       95% CI
#> -------------------
#> 0.25 | [0.17, 0.33]
paired_d(rt ~ cond | id, data = dat, type = "a") # 0.8357347
#> d_a  |       95% CI
#> -------------------
#> 0.84 | [0.26, 1.44]
paired_d(rt ~ cond | id, data = dat, type = "z") # 1.353713
#> d_z  |       95% CI
#> -------------------
#> 1.35 | [0.82, 1.93]
paired_d(rt ~ cond | id, data = dat, type = "t") # 1.914439
#> d_t  |       95% CI
#> -------------------
#> 1.91 | [1.15, 2.73]
paired_d(rt ~ cond | id, data = dat, type = "r") # 0.259
#> d_r  |       95% CI
#> -------------------
#> 0.26 | [0.18, 0.34]

# More types:
# https://doi.org/10.3389/fpsyg.2013.00863
# https://journals.sagepub.com/doi/pdf/10.1177/0013164403256358 (CIs)
paired_d(rt ~ cond | id, data = dat, type = "rm")
#> d_rm |       95% CI
#> -------------------
#> 0.82 | [0.76, 0.89]
paired_d(rt ~ cond | id, data = dat, type = "av")
#> d_av |       95% CI
#> -------------------
#> 0.84 | [0.77, 0.90]

^{Created on 2022-01-28 by the reprex package (v2.0.1)}

@bwiernik This is essentially done. Still...

[ ] Not sure what the best interface / name for this function is
[ ] If you have the time to go over the code to see that everything (especially the CIs) make sense

:)

Jan 27 '22 22:01 mattansb

It seems like method a and av give very similar if not identical results - I think they actually are the same. Can someone check me on that? (The CI method of "av" gives more reasonable results, so if they are the same, I will use that one)

Jan 29 '22 12:01 mattansb

Change of plan:

Use the paired ds listed here (4 types): https://matthewbjane.quarto.pub/effect-size-and-confidence-intervals-guide/
- These use normal approximate CIs.
function name: paired_d()
- Remove paired option from cohens_d()
input options:
- formula 1: Pair(x1,x2) ~ 1
- formula 2: x ~ group | id (aggregate if needed)
- 3 arguments: x, group, id (aggregate if needed)
- 1 argument x = Pair()
Have small bias correction option?

Oct 31 '23 18:10 mattansb

Hey @MatthewBJane , would love your input on some things here.

Just to bring you up to speed:

Combining from Jake Westfall's blog post, Laken's paper, and your guide we get the following standardized mean differences for repeated measure designs:

For aggregated data (one observation per subject per cell):
- $d_z$
- $d_{av}$ (in Westfall's this is $d_a$)
- $d_b$
- $d_{rm}$
For replication data (more than one observation per subject per cell):
- Cohen's $d$ - just standardize by pooled sd (for aggregated data, this is the same as $d_{av}$)
- $d_r$ - standardize by residual sd (after accounting for random intercepts and slopes)

For the first 4, we can get normal CIs thanks to the SEs in your guide - for the latter two (the replication effect sizes $d$ and $d_r$) do you know of a method for obtaining SEs?
Should all 6 methods be accessible from the same function? What would you call it / them? paired_d() holds for 1-4, but isn't exactly accurate for the replication effect size.
Methods for user input - do these make sense? Can we simplify somehow
- For paired (aggregated data)
  - Two paired vectors x, y ("wide")
  - A formula Pair(x,y)~1 (a-la t.test())
- For replication data ("long" data that can be aggregated!)
  - A formula: x ~ condition | id
Which method should be the default?

Thanks!

Nov 03 '23 09:11 mattansb

Hi thanks for including me into this conversation,

I kind of don't understand the point of $d_a$? I guess one utilization would be to correct for measurement error as with more replicates, you would have a larger $d_a$, due to the fact that each subject would be approaching their true scores. Aside from that, I feel like it's just scoring the data for the user and then calculating a regular $d_{av}$? For $d_r$ I do know of a blog post/R function (lmeInfo::g_mlm) by James Pustejovsky that discusses standardized mean difference and it's standard error from a mixed effects model, see here.
I think the standardized mean difference for replicates should have its own function, I can imagine you could do a good number of things with that function. You could even use Lavaan to calculate the standardized mean difference on pre/post latent variables (and maybe even assess measurement invariance). Then I think the first 4 SMDs could all be one function: $d_z$, $d_{rm}$, $d_b$, $d_{rm}$.
I like the formula method for both paired and replicate data
I think for paired data the default should probably be $d_{rm}$. I personally prefer $d_b$ for a few reasons, but I think $d_{rm}$ is more common.

I also want to add to the pre/post/control/treatment comparison's that Brenton had mentioned. I personally think the best strategy is using $d_b$ (standardizing by baseline/pre-test; i.e., Becker's d), but this strategy works for any of the paired effect sizes:

$$d_C = \frac{M^{\text{post}}_C-M^{\text{pre}}_C}{S^{\text{pre}}_C}$$

$$d_T = \frac{M^{\text{post}}_T-M^{\text{pre}}_T}{S^{\text{pre}}_T}$$

$$d_{ATE} = d_T-d_C$$

Then the standard error could be computed easily from the standard errors of $d_C$ and $d_T$:

$$SE_{ATE} = \sqrt{SE_T^2 + SE_C^2}$$

You could also reduce the sampling error variance by pooling the pre-test standard deviations instead ($S^{\text{pre}}_P$).

$$d_{ATE} =\frac{(M^{\text{post}}_T-M^{\text{pre}}_T) - (M^{\text{post}}_C-M^{\text{pre}}_C)}{S^{\text{pre}}_P}$$

However I am not sure how to calculate the standard error for this. I recall a paper somewhere that had the standard error for this but I can not seem to find it. Now, you can do the same strategy with any of the first 6 effect sizes. I would suggest that the function would spit out the treatment group SMD ($d_T$), the control group SMD ($d_C$), and the standardized average treatment effect ($d_{ATE}$)

Nov 03 '23 15:11 MatthewBJane

Thanks @MatthewBJane!

In my experience, repeated measures designs more often than not include replications (either in multiple measurements in each condition or multiple items on a questionnaire), so I might be wrong, but most "paired" analysis would thus be aggregated-and-then-paired analysis. 🤷‍♂️ So it doesn't at all seem weird to me that the effect size will increase with additional repetitions... but maybe that's just me?

1. d and SE for $d$ and $d_r$

I was able to implement the method presented in {lmeInfo} with stats::aov():

data(stroop, package = "afex")
stroop1 <- subset(stroop, study == 1 & condition == "control" & acc == 1)

mod <- aov(rt ~ congruency + Error(pno / congruency), data = stroop1, 
           contrasts = list(congruency = contr.treatment))

MLM <- nlme::lme(fixed = rt ~ congruency, random = ~ congruency | pno,
                 contrasts = list(congruency = contr.treatment),
                 data = stroop1, control = nlme::lmeControl(opt = "optim"))

Just $d$

m <- coef(mod[["pno:congruency"]])
m_V <- vcov(mod[["pno:congruency"]])

pars <- parameters::model_parameters(mod)
e <- as.data.frame(pars[pars$Parameter == "Residuals",])
s <- sqrt(sum(e[["Sum_Squares"]]) / sum(e[["df"]]))
df <- sum(e[["df"]])

d <- m/s
se <- sqrt((df / (df - 2)) * (m_V / (s^2)) + (d ^ 2) * (8 * df ^2 - df + 2) / (16*(df-2)*((df-1)^2)))

J <- (1 -  3 / (4 * df - 1)) 
c(d = unname(d * J), sd_d = se * J)
#>          d       sd_d 
#> 0.58609285 0.02532089

# Compare to:

lmeInfo::g_mlm(MLM, p_const = c(0, 1), r_const = c(1,1,1,1))[c("g_AB", "SE_g_AB")]
#> $g_AB
#> [1] 0.5464201
#> 
#> $SE_g_AB
#> [1] 0.0253104

$d_{r}$

m <- coef(mod[["pno:congruency"]])
m_V <- vcov(mod[["pno:congruency"]])

pars <- parameters::model_parameters(mod)
e <- as.data.frame(pars[pars$Group == "Within",])
s <- sqrt(sum(e[["Mean_Square"]]))
df <- e[["df"]]

d <- m/s
se <- sqrt((df / (df - 2)) * (m_V / (s^2)) + (d ^ 2) * (8 * df ^2 - df + 2) / (16*(df-2)*((df-1)^2)))

J <- (1 -  3 / (4 * df - 1)) 
c(d = unname(d * J), sd_d = se * J)
#>          d       sd_d 
#> 0.70567780 0.03048858

# Compare to:

lmeInfo::g_mlm(MLM, p_const = c(0, 1), r_const = c(0,0,0,1))[c("g_AB", "SE_g_AB")]
#> $g_AB
#> [1] 0.7035682
#> 
#> $SE_g_AB
#> [1] 0.0303898

I'd say - pretty close!

2+3+4. Functions

I just wrote a whole thing, but then realized that "just d" can also be applied to paired data:

data(sleep)

mod <- aov(extra ~ group + Error(ID / group), data = sleep, 
           contrasts = list(group = contr.treatment))

MLM <- nlme::lme(fixed = extra ~ group, random = ~ 1 | ID, # can't include group in random!
                 contrasts = list(group = contr.treatment),
                 data = sleep, control = nlme::lmeControl(opt = "optim"))


m <- coef(mod[["ID:group"]])
m_V <- vcov(mod[["ID:group"]])

pars <- parameters::model_parameters(mod)
e <- as.data.frame(pars[pars$Parameter == "Residuals",])
s <- sqrt(sum(e[["Sum_Squares"]]) / sum(e[["df"]]))
df <- sum(e[["df"]])

d <- m/s
se <- sqrt((df / (df - 2)) * (m_V / (s^2)) + (d ^ 2) * (8 * df ^2 - df + 2) / (16*(df-2)*((df-1)^2)))

J <- (1 -  3 / (4 * df - 1)) 
c(d = unname(d * J), sd_d = se * J)
#>         d      sd_d 
#> 0.7970185 0.2557877 

# Compare to:

lmeInfo::g_mlm(MLM, p_const = c(0, 1), r_const = c(1,1))[c("g_AB", "SE_g_AB")]
#> $g_AB
#> [1] 0.7745582
#> 
#> $SE_g_AB
#> [1] 0.2803914

(This gives a different result than av, rm, or z.)

And so I don't see a justification to split the this into two functions. We can have a single function paired_d (personal favorite) or rm_d or dependant_d (?):

input:
- x,y, paired vactors or Pair(x,y)~1 formula for wide data
- x~condition|id formula for long data (possible with replications)
types:
- rm (default), av, b, z (require paired data, so replications will be aggregated)
- d
- r (required replications)

Even more??

I think the standardized mean difference for replicates should have its own function, I can imagine you could do a good number of things with that function.

I think all those things are already covered by lmeInfo::g_mlm() and / or emmeans::effsize() and we can reference those functions. WDYT?

Even more yet??

As for the pre-post/control-treatment effect size - I think that deserves it's own issue (:

@bwiernik any thoughts?

Nov 03 '23 20:11 mattansb

Okay, here's what I have so far - I think this is looking great!

# Paired data
data(sleep)

paired_d(Pair(extra[group == 1], extra[group == 2]) ~ 1, data = sleep, type = "rm")
#> d_rm  |         95% CI
#> ----------------------
#> -0.75 | [-1.17, -0.33]
paired_d(Pair(extra[group == 1], extra[group == 2]) ~ 1, data = sleep, type = "av")
#> d_av  |        95% CI
#> ---------------------
#> -0.76 | [-1.60, 0.07]
paired_d(Pair(extra[group == 1], extra[group == 2]) ~ 1, data = sleep, type = "z")
#> d_z   |         95% CI
#> ----------------------
#> -1.17 | [-1.94, -0.41]
paired_d(Pair(extra[group == 1], extra[group == 2]) ~ 1, data = sleep, type = "b")
#> d_b   |         95% CI
#> ----------------------
#> -0.81 | [-1.31, -0.30]

paired_d(Pair(extra[group == 1], extra[group == 2]) ~ 1, data = sleep, type = "d")
#> d_d   |         95% CI
#> ----------------------
#> -0.80 | [-1.30, -0.30]


dat <- read.table("effectSizePuzzler.txt", header = TRUE)

paired_d(rt ~ cond | id, data = dat, type = "rm")
#> d_rm  |         95% CI
#> ----------------------
#> -0.80 | [-1.06, -0.53]
paired_d(rt ~ cond | id, data = dat, type = "av", adjust = FALSE) # jakewestfall.org/blog: 0.84
#> d_av  |         95% CI
#> ----------------------
#> -0.84 | [-1.41, -0.26]
paired_d(rt ~ cond | id, data = dat, type = "z", adjust = FALSE) # jakewestfall.org/blog: 1.35
#> d_z   |         95% CI
#> ----------------------
#> -1.35 | [-1.90, -0.81]
paired_d(rt ~ cond | id, data = dat, type = "b")
#> d_b   |         95% CI
#> ----------------------
#> -0.86 | [-1.19, -0.53]

paired_d(rt ~ cond | id, data = dat, type = "d") # jakewestfall.org/blog: 0.25
#> d_d   |         95% CI
#> ----------------------
#> -0.25 | [-0.32, -0.18]

paired_d(rt ~ cond | id, data = dat, type = "r") # jakewestfall.org/blog: 0.26
#> d_r   |         95% CI
#> ----------------------
#> -0.26 | [-0.33, -0.18]

Nov 03 '23 21:11 mattansb

@mattansb This looks excellent! do you think it would be better if the types had more meaningful labels rather than the subscripts? For example the becker's d formulation I used $d_b$ in the guide but I don't think that is conventional. Maybe instead of type could it be standardizer?

"rm" -> "within.subject"
"av" -> "average.variance"
"z" -> "difference.score"
"b" -> "pre.test"
"r" -> "lme.residual"

It definitely makes sense that the $d_{a}$ would increase with more measurements, wouldnt this just be due to reduction in measurement error variance thus reducing the variance in scores and increasing $d_a$? I have two questions/thoughts on this:

Wouldn't multiple measurements also occur in an unpaired designs? Should there be functionality for aggregating then multiple measurements for unpaired designs as well?
I would worry people may use $d_a$ without thinking and they start averaging measurements that are on different scales. There could be a few ways to calculate scores (raw average, principle component/factor scores, z-score averaging) which may produce different $d$ values. Maybe different ways of scoring the data could be added, for example: averaging = "factor.scores" or averaging = "z-scores".

Nov 04 '23 18:11 MatthewBJane

I have changed the name of the function to repeated_measures_d() (with an alias rm_d()). I think keeping the subscripts as the argument input is easier for the user to switch between them. I have the docs looking like this:

(I still need to polish those out...)

As for the argument name - maybe scaler, denominator (or den)?

I'm not sure why you singling out $d_a$ - aggregation is required for av, z, rm, and b. In all of those cases (paired), more replication will reduce measurement error and increase the measured effect size bringing it closer to the "true" (latent) effect size.

Perhaps if there's aggregation we can give a message (this is what afex::aov_ez() does)?

The {type} standardized difference requires paired data, but data contains more than one observation per design cell.
Aggregating data using 'mean'.

I would worry people may use $d_a$ without thinking

I'm already assuming everyone is doing this anyway 😅 - but we can add this information in the documentation / vignette.

[...] and they start averaging measurements that are on different scales.

This doesn't sound like it would be very common - why would someone have in long-format a single column with different scaled values? Assuming this is the data that would otherwise be fed into lmer() or the like...

Maybe different ways of scoring the data could be added, for example: [...]

Can you elaborate on this? How can you average for each condition with z scores? This would just give 0s to everyone... (Not sure what "factor scores" would be here either.)

Nov 04 '23 21:11 mattansb

Close in favor of #618

The type argument has been named method to be consistent with other such uses in effectsize.

Nov 06 '23 23:11 mattansb

effectsize effectsize copied to clipboard

Improved d values for paired designs

1. d and SE for $d$ and $d_r$

Just $d$

$d_{r}$

2+3+4. Functions

Even more??

Even more yet??

effectsize
effectsize copied to clipboard