mne-python icon indicating copy to clipboard operation
mne-python copied to clipboard

new cluster test API

Open CarinaFo opened this issue 1 year ago • 16 comments

set up new cluster_test api that sets up design matrix based on Wilkinson formula (using formulaic package)

  • currently unable to continue working on this project due to moving continent (soon based in Australia), estimated back to normal by end of November

CarinaFo avatar Jun 14 '24 12:06 CarinaFo

This might be related:

https://github.com/mne-tools/mne-incubator/issues/31

cc @SophieHerbst

hoechenberger avatar Jun 15 '24 15:06 hoechenberger

I think it is a great idea to revamp the current cluster permutation test API 🚀! Could you please share the reasoning behind choosing a pandas DataFrame as the container for evokeds and related meta info? If possible, I'd try to avoid using pandas (an optional dependency) when something similar could be achieved using, for example, a simple dictionary.

cbrnr avatar Jun 19 '24 13:06 cbrnr

@cbrnr I think the main reason for a dataframe instead of a dictionary is that formulaic only allows for dataframes as input and we want to include Wilkinson formula support in the new cluster_test API.

CarinaFo avatar Jun 19 '24 16:06 CarinaFo

@cbrnr I think the main reason for a dataframe instead of a dictionary is that formulaic only allows for dataframes as input and we want to include Wilkinson formula support in the new cluster_test API.

But this doesn't need to be user-facing, then, no? Just trying to understand. If a user passes in a list of TypedDicts or Dataclasses, you can internally create the DataFrames that need to be passed to formulaic. The user would then also get tab-completion assistance in their editor. But it's just a thought. Great work so far in any case!

hoechenberger avatar Jun 19 '24 16:06 hoechenberger

you can internally create the DataFrames that need to be passed to formulaic

This would still require pandas to be available though.

drammock avatar Jun 19 '24 16:06 drammock

you can internally create the DataFrames that need to be passed to formulaic

This would still require pandas to be available though.

Sure But it would provide a potentially more user-friendly API

hoechenberger avatar Jun 19 '24 16:06 hoechenberger

you can internally create the DataFrames that need to be passed to formulaic

This would still require pandas to be available though.

Sure But it would provide a potentially more user-friendly API

For context, this is step 1 of a GSoC project. A later step involves (probably) creating helper functions that will create the necessary DataFrame for the user. That's not done here because:

  1. there are probably complicated use cases / designs that we won't foresee, so we want it to be possible for the user to pass in their own custom dataframe instead of our internally-generated one.
  2. we're actually not sure how helpful the helper function will be (at least in some cases): if what you need to pass in is a set of matched lists of subject IDs, evoked objects, and condition names, well then you might as well just call DataFrame(dict(subj=subj_list, cond=cond_list, data=evk_list)) yourself instead of calling the helper function. A helper function makes much more sense in other cases, like when dealing with Epochs objects where the conditions are intermixed within one object.

drammock avatar Jun 19 '24 19:06 drammock

@CarinaFo I pushed a commit to add the dataset and add formulaic to our full and doc dependencies so that eventually CIs can use them properly. I added the tags [skip azp] [skip actions] to skip running those CIs. Feel free to git pull the changes back to your local machine!

If you look at the CI runs, you can see that CircleCI, which builds modified examples/tutorials in PRs, hit an error up on 9c8ec90:

sphinx.errors.ExtensionError: Could not find docstring in file "/home/circleci/project/tutorials/stats-sensor-space/76_new_cluster_test_api.py". A docstring is required by sphinx-gallery unless the file is ignored by "ignore_pattern"

Then I pushed a little commit to fix that in 47363b5, and now it hits a different error (which replicates what I saw locally when I tried to run the example):

../tutorials/stats-sensor-space/76_new_cluster_test_api.py unexpectedly failed to execute correctly:

    Traceback (most recent call last):
      File "/home/circleci/project/tutorials/stats-sensor-space/76_new_cluster_test_api.py", line 449, in <module>
        df_long = convert_wide_to_long(df)
      File "/home/circleci/project/tutorials/stats-sensor-space/76_new_cluster_test_api.py", line 431, in convert_wide_to_long
        data_2d = row["data"]
      File "/home/circleci/python_env/lib/python3.10/site-packages/pandas/core/series.py", line 1121, in __getitem__
        return self._get_value(key)
      File "/home/circleci/python_env/lib/python3.10/site-packages/pandas/core/series.py", line 1237, in _get_value
        loc = self.index.get_loc(label)
      File "/home/circleci/python_env/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
        raise KeyError(key) from err
    KeyError: 'data'

It would be good to keep CircleCI green at least so we can see renderings of the tutorial with each push. If you make sure python -i tutorials/stats-sensor-space/76_new_cluster_test_api.py runs cleanly locally each time before you push then it should work on CircleCI now as well!

larsoner avatar Jun 25 '24 16:06 larsoner

@drammock, please review cluster_test function in cluster_level.py before I clean up tutorial

CarinaFo avatar Jul 19 '24 09:07 CarinaFo

@CarinaFo I see most of my comments (and all of @hoechenberger's) marked as "resolved" but I don't see any new commits pushed since those comments were made. Please only mark as resolved if the new code that addresses the comment has been pushed to GitHub (and even then, only if you're sure that what you've changed truly resolves the comment/question).

drammock avatar Jul 22 '24 15:07 drammock

@CarinaFo I was playing around with the cluster test function and I think I came up with a way to completely skip the model matrix / long-form dataframe. This code is incomplete and not fully tested but I think is going in the right direction. Let's discuss on Thursday.

    # validate the input dataframe and return the type of the data column entries
    _dtype = _validate_cluster_df(df, formula)
    # parse formula
    formulaic = _soft_import("formulaic", purpose="parse formula for clustering")
    parser = formulaic.parser.DefaultFormulaParser(include_intercept=False)
    formula = formulaic.Formula(formula, _parser=parser)
    dv_name = str(np.array(formula.lhs.root).item())
    iv_name = str(np.array(formula.rhs.root).item())
    # TODO based on _dtype, decide how to extract the data
    if _dtype is np.ndarray:
        # assume axis order already correct
        func = lambda x: np.concatenate(x.values)
    else:
        # TODO not tested, may need to also triage based on Epochs vs Evoked
        func = lambda x: np.concatenate(y.get_data().swapaxes(-2, -1) for y in x)
    # convert to a list-like X for clustering
    X = df.groupby(iv_name).agg({dv_name: func}).to_list()
    # TODO triage test type
    if len(X) == 1:
        # TODO extract array from outermost list & pass to 1samp test
        pass
    elif len(X) > 2:
        # TODO assume F-test
        pass
    else:
        # TODO triage paired vs unpaired t-test
        pass

drammock avatar Jul 24 '24 23:07 drammock

@CarinaFo I've just pushed a commit to clean up some docstrings / docdict entries so that they're accurate for the new API. There are a couple of TODO comments in there too, related to the max_step and exclude parameters.

Be sure to pull those changes before pushing any more of your work!

drammock avatar Aug 01 '24 17:08 drammock

@kathlin42 this is the PR I was discussing in the office hour today. You can install the WIP new API by doing:

pip install git+https://github.com/drammock/mne-python.git@new_cluster_stats_api_GSOC24

you'll also need pip install formulaic.

Then try creating a DataFrame and passing it to mne.stats.cluster_level.cluster_test()

when you're done testing, you can get back to stable MNE by doing

pip install "mne=1.7"

drammock avatar Aug 16 '24 15:08 drammock

@CarinaFo FYI I needed to do a rebase and force-push in order to get the CIs to run again, as there was a merge conflict. You'll want to recreate your local branch. Ping me if you have questions / challenges (e.g. if you had local work you don't want to lose) and we can work through it together.

drammock avatar Aug 22 '24 20:08 drammock

@CarinaFo, any plans on picking this up again? :) Would be a shame to leave all this effort unmerged and unused!

hoechenberger avatar Mar 30 '25 20:03 hoechenberger

Hi Richard, really want to pick this up again but currently super busy. I hope april will allow me to work on it.

CarinaFo avatar Mar 31 '25 00:03 CarinaFo