new cluster test API
set up new cluster_test api that sets up design matrix based on Wilkinson formula (using formulaic package)
- currently unable to continue working on this project due to moving continent (soon based in Australia), estimated back to normal by end of November
This might be related:
https://github.com/mne-tools/mne-incubator/issues/31
cc @SophieHerbst
I think it is a great idea to revamp the current cluster permutation test API 🚀! Could you please share the reasoning behind choosing a pandas DataFrame as the container for evokeds and related meta info? If possible, I'd try to avoid using pandas (an optional dependency) when something similar could be achieved using, for example, a simple dictionary.
@cbrnr I think the main reason for a dataframe instead of a dictionary is that formulaic only allows for dataframes as input and we want to include Wilkinson formula support in the new cluster_test API.
@cbrnr I think the main reason for a dataframe instead of a dictionary is that formulaic only allows for dataframes as input and we want to include Wilkinson formula support in the new cluster_test API.
But this doesn't need to be user-facing, then, no? Just trying to understand. If a user passes in a list of TypedDicts or Dataclasses, you can internally create the DataFrames that need to be passed to formulaic. The user would then also get tab-completion assistance in their editor. But it's just a thought. Great work so far in any case!
you can internally create the DataFrames that need to be passed to formulaic
This would still require pandas to be available though.
you can internally create the DataFrames that need to be passed to formulaic
This would still require pandas to be available though.
Sure But it would provide a potentially more user-friendly API
you can internally create the DataFrames that need to be passed to formulaic
This would still require pandas to be available though.
Sure But it would provide a potentially more user-friendly API
For context, this is step 1 of a GSoC project. A later step involves (probably) creating helper functions that will create the necessary DataFrame for the user. That's not done here because:
- there are probably complicated use cases / designs that we won't foresee, so we want it to be possible for the user to pass in their own custom dataframe instead of our internally-generated one.
- we're actually not sure how helpful the helper function will be (at least in some cases): if what you need to pass in is a set of matched lists of subject IDs, evoked objects, and condition names, well then you might as well just call
DataFrame(dict(subj=subj_list, cond=cond_list, data=evk_list))yourself instead of calling the helper function. A helper function makes much more sense in other cases, like when dealing with Epochs objects where the conditions are intermixed within one object.
@CarinaFo I pushed a commit to add the dataset and add formulaic to our full and doc dependencies so that eventually CIs can use them properly. I added the tags [skip azp] [skip actions] to skip running those CIs. Feel free to git pull the changes back to your local machine!
If you look at the CI runs, you can see that CircleCI, which builds modified examples/tutorials in PRs, hit an error up on 9c8ec90:
sphinx.errors.ExtensionError: Could not find docstring in file "/home/circleci/project/tutorials/stats-sensor-space/76_new_cluster_test_api.py". A docstring is required by sphinx-gallery unless the file is ignored by "ignore_pattern"
Then I pushed a little commit to fix that in 47363b5, and now it hits a different error (which replicates what I saw locally when I tried to run the example):
../tutorials/stats-sensor-space/76_new_cluster_test_api.py unexpectedly failed to execute correctly:
Traceback (most recent call last):
File "/home/circleci/project/tutorials/stats-sensor-space/76_new_cluster_test_api.py", line 449, in <module>
df_long = convert_wide_to_long(df)
File "/home/circleci/project/tutorials/stats-sensor-space/76_new_cluster_test_api.py", line 431, in convert_wide_to_long
data_2d = row["data"]
File "/home/circleci/python_env/lib/python3.10/site-packages/pandas/core/series.py", line 1121, in __getitem__
return self._get_value(key)
File "/home/circleci/python_env/lib/python3.10/site-packages/pandas/core/series.py", line 1237, in _get_value
loc = self.index.get_loc(label)
File "/home/circleci/python_env/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
raise KeyError(key) from err
KeyError: 'data'
It would be good to keep CircleCI green at least so we can see renderings of the tutorial with each push. If you make sure python -i tutorials/stats-sensor-space/76_new_cluster_test_api.py runs cleanly locally each time before you push then it should work on CircleCI now as well!
@drammock, please review cluster_test function in cluster_level.py before I clean up tutorial
@CarinaFo I see most of my comments (and all of @hoechenberger's) marked as "resolved" but I don't see any new commits pushed since those comments were made. Please only mark as resolved if the new code that addresses the comment has been pushed to GitHub (and even then, only if you're sure that what you've changed truly resolves the comment/question).
@CarinaFo I was playing around with the cluster test function and I think I came up with a way to completely skip the model matrix / long-form dataframe. This code is incomplete and not fully tested but I think is going in the right direction. Let's discuss on Thursday.
# validate the input dataframe and return the type of the data column entries
_dtype = _validate_cluster_df(df, formula)
# parse formula
formulaic = _soft_import("formulaic", purpose="parse formula for clustering")
parser = formulaic.parser.DefaultFormulaParser(include_intercept=False)
formula = formulaic.Formula(formula, _parser=parser)
dv_name = str(np.array(formula.lhs.root).item())
iv_name = str(np.array(formula.rhs.root).item())
# TODO based on _dtype, decide how to extract the data
if _dtype is np.ndarray:
# assume axis order already correct
func = lambda x: np.concatenate(x.values)
else:
# TODO not tested, may need to also triage based on Epochs vs Evoked
func = lambda x: np.concatenate(y.get_data().swapaxes(-2, -1) for y in x)
# convert to a list-like X for clustering
X = df.groupby(iv_name).agg({dv_name: func}).to_list()
# TODO triage test type
if len(X) == 1:
# TODO extract array from outermost list & pass to 1samp test
pass
elif len(X) > 2:
# TODO assume F-test
pass
else:
# TODO triage paired vs unpaired t-test
pass
@CarinaFo I've just pushed a commit to clean up some docstrings / docdict entries so that they're accurate for the new API. There are a couple of TODO comments in there too, related to the max_step and exclude parameters.
Be sure to pull those changes before pushing any more of your work!
@kathlin42 this is the PR I was discussing in the office hour today. You can install the WIP new API by doing:
pip install git+https://github.com/drammock/mne-python.git@new_cluster_stats_api_GSOC24
you'll also need pip install formulaic.
Then try creating a DataFrame and passing it to mne.stats.cluster_level.cluster_test()
when you're done testing, you can get back to stable MNE by doing
pip install "mne=1.7"
@CarinaFo FYI I needed to do a rebase and force-push in order to get the CIs to run again, as there was a merge conflict. You'll want to recreate your local branch. Ping me if you have questions / challenges (e.g. if you had local work you don't want to lose) and we can work through it together.
@CarinaFo, any plans on picking this up again? :) Would be a shame to leave all this effort unmerged and unused!
Hi Richard, really want to pick this up again but currently super busy. I hope april will allow me to work on it.