mne-python Permutation test number of tails

Description of the problem

The p-value produced by the function permutation_t_test does not depend on whether a one- or two-tailed test is requested. This is not the expected behaviour.

Steps to reproduce

import numpy as np
import mne.stats as stats

testData = np.atleast_2d(np.arange(5)).T

for thisTail in [0, 1]:
    _, p_values, _ = stats.permutation_t_test(testData, tail=thisTail)
    print('p-value: {}'.format(p_values[0]))

Link to data

No response

Expected results

The p-value for a one-tailed test should be about half the size of the p-value for a two-tailed test.

Actual results

The p-value for a one-tailed test and two-tailed test are the same.

Permuting 255 times (exact test)...
p-value: 0.0078125
Permuting 511 times (exact test)...
p-value: 0.0078125

Additional information

MNE version: 0.22.0 Platform: Windows-10 Python: 3.8.6

numpy: 1.19.5 {blas=NO_ATLAS_INFO, lapack=lapack} scipy: 1.6.0 matplotlib: 3.3.3 {backend=Qt5Agg}

sklearn: 0.24.0 numba: 0.52.0 nibabel: 3.2.1 nilearn: 0.7.0 dipy: 1.3.0 cupy: Not found pandas: 1.2.0 mayavi: 4.7.2 pyvista: 0.27.4 {pyvistaqt=0.2.0, OpenGL 4.5.0 - Build 30.0.101.1692 via Intel(R) UHD Graphics 620} vtk: 9.0.1 PyQt5: 5.12.3

Feb 28 '23 10:02 jCalderTravis

Hello! 👋 Thanks for opening your first issue here! ❤️ We will try to get back to you soon. 🚴🏽‍♂️

Feb 28 '23 10:02 welcome[bot]

would you have the time to open a PR with a test eg using some randn data comparing with the p-values obtained with a parametric ttest?

thanks

Message ID: @.***>

Feb 28 '23 12:02 agramfort

I see your MNE version is rather old, does this problem still occur with the latest release?

Feb 28 '23 12:02 hoechenberger

related: some time ago (in years, so might not be relevant any longer) I thought one of the permutation cluster test functions in mne had the same issue. I never dug deeper to confirm because at that time I started using my own code for cluster-based permutation test, but I can take a look and check this in this or the upcoming week.

Feb 28 '23 17:02 mmagnuski

Thank you very much for your replies. I will test it with the latest MNE version and also try to open a PR with a test next week.

Mar 01 '23 13:03 jCalderTravis

your test data are all non-negative. Try this:

import numpy as np
import mne.stats as stats

test_data = np.atleast_2d(np.arange(500) - 250).T  # <- critical change here

for name, tail in dict(lower=-1, upper=1, both=0).items():
    _, p_values, _ = stats.permutation_t_test(test_data, tail=tail)
    print(f'{name} tail: p={p_values[0]}')

result (omitting the joblib lines):

lower tail: p=0.9409
upper tail: p=1.0
both tail: p=0.9412

Mar 07 '23 16:03 drammock

Thanks for your reply. I don't think this behaviour is the expected behaviour. The data in your example are symmetric around zero, with zero mean. The distribution of t-statistics computed through permutations will also be symmetric around zero. Hence for a two-tailed test the expected p-value is indeed about 1, but for single tailed tests the expected p-value is about 0.5, because the test statistic (t-statistic of 0) lies half way along the t-value distribution computed through permutations.

Comparing to a parametric test, using normally distributed data, and now using MNE version 1.3.1,

import numpy as np
import mne.stats as stats
import scipy.stats as sciStats

test_data = np.random.normal(size=(10000, 1))

for name, tail in {'less':-1, 'greater': 1, 'two-sided': 0}.items():
    _, p_value, _ = stats.permutation_t_test(test_data, 
                                                tail=tail,
                                                verbose=False)
    print('Permutation {} test: p={}'.format(name, np.round(p_value[0], 4)))

    result = sciStats.ttest_1samp(test_data, 
                                    popmean=0,
                                    axis=None,
                                    alternative=name)
    p_value = result.pvalue
    print('Parametric {} test: p={}'.format(name, np.round(p_value, 4)))

    print('')

gives the result:

Permutation less test: p=0.8361
Parametric less test: p=0.4175

Permutation greater test: p=1.0
Parametric greater test: p=0.5825

Permutation two-sided test: p=0.8279
Parametric two-sided test: p=0.8349

I will try to submit a PR with a test based on this.

Mar 09 '23 09:03 jCalderTravis

@jCalderTravis Thanks, the p values should indeed be very close, while the permutation test has 2 * parametric_p_value... (I expected it to be the other way). Maybe the p values are multiplied by 2 (two-tail correction) irrespective of the tail argument?

Mar 09 '23 09:03 mmagnuski

@mmagnuski good point. It does seem that the issue is with the one-tailed tests, and these p-values are approximately twice as big as they should be. I am afraid that I could not yet understand the code enough to determine the cause. I will try to make a PR with a test, but I need a bit of time because I haven't contributed before.

Mar 10 '23 10:03 jCalderTravis

@jCalderTravis I took a look at the code yesterday. The problem was that for one-tail tests, a one-tail distribution was created by taking the absolute value of the test statistic during the permutations. This approximately makes the p values twice larger (a 5% of a positive tail is 2.5% of the whole distribution). This can be fixed by removing the absolute value from _max_stat function and applying the correction only to two-tail (current implementation effectively applied 2-tail correction also to one-tailed tests).

Below is the changed code from mne (with imports modified to work outside mne codebase) and your original example (slightly modified):


from math import sqrt
import numpy as np

from mne.utils import check_random_state, verbose, logger
from mne.parallel import parallel_func


def _max_stat(X, X2, perms, dof_scaling):
    """Aux function for permutation_t_test (for parallel comp)."""
    n_samples = len(X)
    mus = np.dot(perms, X) / float(n_samples)
    stds = np.sqrt(X2[None, :] - mus * mus) * dof_scaling  # std with splitting
    max_abs = np.max(mus / (stds / sqrt(n_samples)), axis=1)  # t-max
    return max_abs


@verbose
def permutation_t_test(X, n_permutations=10000, tail=0, n_jobs=None,
                       seed=None, verbose=None):
    from mne.stats.cluster_level import _get_1samp_orders
    n_samples, n_tests = X.shape
    X2 = np.mean(X ** 2, axis=0)  # precompute moments
    mu0 = np.mean(X, axis=0)
    dof_scaling = sqrt(n_samples / (n_samples - 1.0))
    std0 = np.sqrt(X2 - mu0 ** 2) * dof_scaling  # get std with var splitting
    T_obs = np.mean(X, axis=0) / (std0 / sqrt(n_samples))
    rng = check_random_state(seed)
    orders, _, extra = _get_1samp_orders(n_samples, n_permutations, tail, rng)
    perms = 2 * np.array(orders) - 1  # from 0, 1 -> 1, -1
    logger.info('Permuting %d times%s...' % (len(orders), extra))
    parallel, my_max_stat, n_jobs = parallel_func(_max_stat, n_jobs)
    max_abs = np.concatenate(parallel(my_max_stat(X, X2, p, dof_scaling)
                                      for p in np.array_split(perms, n_jobs)))
    max_abs = np.concatenate((max_abs, [np.abs(T_obs).max()]))
    H0 = np.sort(max_abs)
    if tail == 0:
        # multiplying by 2 is for two-tail correction
        p_values = np.minimum(
            (H0 >= np.abs(T_obs[:, np.newaxis])).mean(-1) * 2, 1.)
    elif tail == 1:
        p_values = (H0 >= T_obs[:, np.newaxis]).mean(-1)
    elif tail == -1:
        p_values = (-H0 <= T_obs[:, np.newaxis]).mean(-1)
    return T_obs, p_values, H0


# %% the original example
from scipy.stats import ttest_1samp

n_permutations = 1_000
test_data = np.random.normal(size=(1000, 1))

for name, tail in {'less':-1, 'greater': 1, 'two-sided': 0}.items():
    _, p_value, _ = permutation_t_test(
        test_data, n_permutations=n_permutations, tail=tail, verbose=False)
    print(f'Permutation {name} test: p={p_value[0]:.4f}')

    result = ttest_1samp(test_data, popmean=0, alternative=name)
    print(f'Parametric {name} test: p={result.pvalue[0]:.4f}', end='\n\n')

This produces better results now:

Permutation less test: p=0.6250
Parametric less test: p=0.6239

Permutation greater test: p=0.3900
Parametric greater test: p=0.3761

Permutation two-sided test: p=0.7380
Parametric two-sided test: p=0.7523

Mar 10 '23 11:03 mmagnuski

another, simpler option would be to just divide by 2 in the one-tail branches to get rid of the two-tail correction. :)

Mar 10 '23 11:03 mmagnuski

can you open a PR for this?

Message ID: @.***>

Mar 12 '23 15:03 agramfort

Sure, I'll do that today.

Mar 12 '23 16:03 mmagnuski

It might be slightly trickier than that. I think currently we only do half the permutations because in a two-tailed test the other half will have the same extreme cluster values as the other half. In a one-tailed test that won't necessarily be the case. So I think in the one-tailed case we actually need to modify the test to do all permutations (and make sure that we're correctly thresholding just the positive or negative half, not both)

Mar 12 '23 16:03 larsoner

tagging this for the intermediate dev sprint. TO whoever takes this on: a test has been added with an XFAIL annotation in #11575, so the task now is to remove that (so the test fails noisily instead of silently) and then fix the implementation so that the test passes.

Jul 05 '23 15:07 drammock

I will work on this during the sprint

Nov 13 '23 10:11 kimcoco

The following version should collect an H0 with t_max values per permutation with their original sign (i.e. neg/pos); and do the correct thresholding on neg, pos or both ends - see comment lines, where code was changed in _max_stat and permutation_t_test.

from math import sqrt
import numpy as np

from mne.utils import check_random_state, verbose, logger
from mne.parallel import parallel_func

def _max_stat(X, X2, perms, dof_scaling):
    """Aux function for permutation_t_test (for parallel comp)."""
    n_samples = len(X)
    mus = np.dot(perms, X) / float(n_samples)
    stds = np.sqrt(X2[None, :] - mus * mus) * dof_scaling  # std with splitting
    # calculating and picking the absolute maximum t-values with their original +/- sign
    tvals = mus / (stds / sqrt(n_samples))
    max_abs = np.squeeze(np.take_along_axis(tvals, np.argmax(np.abs(tvals), axis=1,
                                                             keepdims=True), axis=1))
    return max_abs

@verbose
def permutation_t_test(X, n_permutations=10000, tail=0, n_jobs=None,
                       seed=seed, verbose=None):
    from mne.stats.cluster_level import _get_1samp_orders
    n_samples, n_tests = X.shape
    X2 = np.mean(X ** 2, axis=0)  # precompute moments
    mu0 = np.mean(X, axis=0)
    dof_scaling = sqrt(n_samples / (n_samples - 1.0))
    std0 = np.sqrt(X2 - mu0 ** 2) * dof_scaling  # get std with var splitting
    T_obs = np.mean(X, axis=0) / (std0 / sqrt(n_samples))
    rng = check_random_state(seed)
    orders, _, extra = _get_1samp_orders(n_samples, n_permutations, tail, rng)
    perms = 2 * np.array(orders) - 1  # from 0, 1 -> 1, -1
    logger.info('Permuting %d times%s...' % (len(orders), extra))
    parallel, my_max_stat, n_jobs = parallel_func(_max_stat, n_jobs)
    max_abs = np.concatenate(parallel(my_max_stat(X, X2, p, dof_scaling)
                                      for p in np.array_split(perms, n_jobs)))
    # concatenate also the signed T_obs absmax
    max_abs = np.concatenate((max_abs,
                              np.atleast_1d(T_obs.flat[np.abs(T_obs).argmax()])))
    H0 = np.sort(max_abs)
    # testing T_obs against the actual neg, pos, or both tails
    if tail == 0:
        p_values = (np.abs(H0) >= np.abs(T_obs[:, np.newaxis])).mean(-1)
    elif tail == 1:
        p_values = (H0 >= T_obs[:, np.newaxis]).mean(-1)
    elif tail == -1:
        p_values = (H0 <= T_obs[:, np.newaxis]).mean(-1)
    return T_obs, p_values, H0

This produces the correct output on the test used in the discussion above:

# %% the original example
from scipy.stats import ttest_1samp

n_permutations = 1_000
test_data = np.random.normal(size=(1000, 1))

for name, tail in {'less':-1, 'greater': 1, 'two-sided': 0}.items():
    _, p_value, _ = permutation_t_test(
        test_data, n_permutations=n_permutations, tail=tail, verbose=False)
    print(f'Permutation {name} test: p={p_value[0]:.4f}')

    result = ttest_1samp(test_data, popmean=0, alternative=name)
    print(f'Parametric {name} test: p={result.pvalue[0]:.4f}', end='\n\n')

as well as on a variety of usage scenarios with simulated data arrays like those in test_permutations.py, and passes the tests there, too, with the new tail test added in https://github.com/mne-tools/mne-python/pull/11575/files.

The correct treatment of negative and positive tails of real and permuted data seems important especially when we don't have a perfect normal distribution - which permutation tests are actually intended for. This is also how permutation and cluster statistics are handled in fieldtrip, see e.g.:

As for the number of permutations - half or full - (mentioned by @larsoner), I will look into that next to see how they are handled currently and if it needs to be adjusted.

I will do a draft PR now so you can check the code already and see if I make sense or am mistaken somehow...

(..you will see that while old and new data and tail tests pass with the sign-handling changes I did, the 'equivalent result to spatiotemporal cluster test' test fails now.. )

Nov 16 '23 01:11 kimcoco

I did some digging through the FieldTrip code of https://github.com/fieldtrip/fieldtrip/blob/master/ft_statistics_montecarlo.m and what they are doing is indeed what @kimcoco reported:

compute probabilities for negative and positive tail separately
take minimal p-value to report among the negative and positive tail

In case of clustering on top, both the negative and positive clusters are reported, but that is beyond what we are discussing right now.

I also looked at @kimcoco's snippet and that seems alright to me. The differences between the two solutions indeed comes from the fact that the positive and negative tail of H0 are not identical, thus you might slightly over- or underestimate you p-value if you disregard the negative tail and take you p-value times two given the H0 distribution. Now I guess the question is if that is always just a sampling error (i.e. we could disregard it) or if that can reflect true biases in the data. I tend towards the latter - also just thinking of the cases where one might run a full permutation.

Nov 16 '23 15:11 britta-wstnr

I also tend toward the latter and think we should run a full permutation when the tail is not zero / two-tailed

Nov 16 '23 16:11 larsoner

@larsoner do you mean "full permutation" as in "not estimate negative tail from the positive one" or as in "let's run an exhaustive permutation". I think the latter can take forever in many cases ...

Nov 20 '23 09:11 britta-wstnr

Yeah I mean don't estimate the negative tail from the positive one

Nov 20 '23 17:11 larsoner

Then I'm with you @larsoner! 😄

Nov 20 '23 17:11 britta-wstnr

hi all, i also had a look into the code (before seeing your discussion here)..

(..you will see that while old and new data and tail tests pass with the sign-handling changes I did, the 'equivalent result to spatiotemporal cluster test' test fails now.. )

i think this test acutally fails for code related reasons. currently, tail=-1 is not passed to permutation_cluster_1samp_test when calculating the comparison values (but should be, as it compares to the one tailed ttest above). makes sense, that it didnt break earlier, when tails were diseregarded in permutation_t_test :)

when adding tail=-1, the two functions actually produce the same results/pvalues. what makes the assertion fail nevertheless is keep = p_values < 1. p values are <1 for all the 5 t_values from the ttest now (while the cluster level test kicks out 2 of them). unfortunately, im not deep enough in the matter yet - i think either, permutation_cluster_1samp_test kicks out two clusters that shouldnt be kicked out in the one-tailed case, or keep rule should be updated..

Nov 25 '23 19:11 dominikwelke

hi , thank you for looking in, @dominikwelke! I already know this - i.e. why this equivalence test fails-, and have made these changes in my branch a week ago. The reason this has not moved on is that a) I was too ill to work/post the whole past week, and b) the cluster permutation test still has to be thoroughly checked for its behavior, which I have started on, to describe it in an issue and propose a fix (if needed) that makes both behave similarly in terms of H0 collection and one/two-tailed testing.

The state here is that the cluster permutation test (including all versions, as they all call _permutation_cluster_test in the background) is not as "wrong" as the permutation_t_test in handling the signed tails, i.e. it is checking one-tailed hypotheses against the right tail and is not "doubling" their p-value.

Where it is different, is that in the one-tailed version, it only collects the H0 for one tail, and I have been checking different use cases (e.g. highly sign. positive and marginally sign. negative sample (or flipped)) and the resulting p-values with all tail-test versions etc. Here, in cases where both positive and negative clusters are present, but you test one-sided, the p-value can show a tiny difference to the one tested against the same tail of an H0 that was collected for both tails. (There is also a sign error in the case, where the negative T-max of the data gets appended to the H0 with the wrong sign - but that's easy to fix)

So, this relates back to the question of 'full permutation'/number of permutations. While for two-tailed testing, as discussed above, one needs to collect H0 with both tails and not estimate one from the other, it should conceptually be OK to just collect the pos/neg H0-tail for a one-tailed test - if the picking of the signed T-max values is handled correctly. However, I think the tiny p-value differences between the two-tailed H0/ one-tailed test (as in current permutation_t_test) and one-tailed H0/ one-tailed test (as in current _permutation_cluster_test) in certain cases come from the number of data points (number of collected T-max values) in the opposite tail, as they are calculated as percentage of H0-values lower/higher than data T-max. With the same number of overall permutations, this makes the one-tailed test on one-tailed H0 a bit 'stricter', i.e. producing a little bit higher p-value, than on two-tailed H0, especially if we have a lot of t-values with opposite signs ... which can be the case with real data.

Given that we usually use these functions with fixed n_permutations of 1000, 10000 (as full permutations cannot be calculated), but data and resulting t-values, T-max collections (H0s) might not be fully symmetrical around 0, it seems more of a theoretical choice to say I want my p-value in a 'lesser test' to represent the probability of my negative t-value against that of all negative clusters resulting from permutation, or the probability against any significant clusters resulting from permutation.

While I would also tend to say, in the case of clusters, staying strictly in one tail - as in the current implementation of _permutation_cluster_test - makes sense, it definitely also makes sense to collect and test against a two-tailed H0, when running a permutation_t_test without clustering, like in this issue.

So - what level of equivalence between these two do we want and want to be tested in test_permutations.py?

This would be my remaining question for this issue/PR.... ... but I would also like to first finish checking _permutation_cluster_test and post an issue about the wrong-signed T-max appending... ... and then probably these changes, if approved, should go in together? Or should I just include it in this branch/PR, if the change is tiny?

Nov 26 '23 11:11 kimcoco

Hi @kimcoco

sorry that it took me a while to react to your last message. Great to see you are making progress on this! And please don't worry about not posting when you are ill.

I thought about this for a while now and the easiest way for me personally to conceptualize this is the following (I am writing this out so we can spot any misunderstandings or maybe where I understood things incorrectly): For a one sided test, we collect let's our permutations. Then we look at the permutations that are greater than our observation (if we assume a "greater than" H1). That percentage gives us the p-value. I.e., we look at only the positive tail of the distribution under H0. For a two sided test, we collect our permutations. Then we look at the absolute values of the permutations and see how many are greater than the absolute value of our observation to get our p-value. I.e., we look at both tails of the H0 distribution. Now, we have to adjust the alpha level by dividing it by two to account for the two tails (e.g., 0.025 instead of 0.05).

With the same number of overall permutations, this makes the one-tailed test on one-tailed H0 a bit 'stricter', i.e. producing a little bit higher p-value, than on two-tailed H0, especially if we have a lot of t-values with opposite signs ... which can be the case with real data.

So even with perfectly symmetric tails, the p-value of the two sided test should not be the same as from a one sided test, no? I know the p-values can be adapted so if that is being done, maybe the difference comes from the conversion?

From my point of view, we want to be as precise as possible with collecting tails and computing p-values. Meaning, we do not want to take one tail times two, and we do not want to mix positive and negative observations in the output. We can collect absolute values of both tails to determine p-values though (for two tailed testing), I think that is fine as it should not result in a different p-value.

Regarding your other questions:

This would be my remaining question for this issue/PR.... ... but I would also like to first finish checking _permutation_cluster_test and post an issue about the wrong-signed T-max appending... ... and then probably these changes, if approved, should go in together? Or should I just include it in this branch/PR, if the change is tiny?

To me it makes sense to discuss all of this in one issue to keep track of how this interacts/relates to each other and also to have it in one PR. In my view this is a good exception to "open smaller PRs as they are easier to review", because I would have to reference the other PR in my head all the time anyway, I assume!

Dec 05 '23 11:12 britta-wstnr