pycrostates
pycrostates copied to clipboard
Add ClusterArray class to import any kind of microstates maps
Hello, this is the new feature I described in Issues, and I open a pull requests here. This is a draft of the thoughts, and not heavily tested.
Closes #117
I think that addition to the API would be very welcome. My initial impression is that it should mimic the ClassArray API from MNE-Python. Examples:
In all 5 cases, the class signature is ClassArray(data, info, optional_arguments). Thus, I would define a ClusterArray class which would ingest fitted data.
class ClusterArray(_BaseCluster):
def __init__(
data: NDArray[float],
info: Union[Info, ChInfo],
cluster_names: Optional[List[str]] = None,
fitted_data: Optional[NDArray[float]] = None,
labels: Optional[NDArray[int]] = None,
):
Where data would be the cluster_centers_ (named data to match MNE-Python's API). The number of clusters n_clusters can be inferred from data. Note that I've set cluster_names, fitted_data, and labels as Optional. I don't think they are key elements that we require to run other methods/functions from the pycrostates API. They are "nice-to-have" variables, but not mandatory. I could be wrong however, I would have to check (especially for labels).
WDYT @RunKZhang ?
@RunKZhang I fixed my proposed class signature in the previous post, I defined it as a function for some reason :see_no_evil:
I agree with you for the ClassArray, and data could be cluster_centers_. But I think labels should be mandatory, since it is essential to calculate gev_.
Sorry I don't get why define it as a function. Since the predict method helps make segmentation of the raw object, and it is inherited from _BaseCluster, I think define it as a class is preferable.
Sorry I don't get why define it as a function. Since the predict method helps make segmentation of the raw object, and it is inherited from _BaseCluster, I think define it as a class is preferable.
Definitely! I just made a mistake in my original comment. prior to edit, where I wrote:
class ClusterArray(
data: NDArray[float],
info: Union[Info, ChInfo],
cluster_names: Optional[List[str]] = None,
fitted_data: Optional[NDArray[float]] = None,
labels: Optional[NDArray[int]] = None,
):
which was obviously wrong!
But I think labels should be mandatory, since it is essential to calculate gev_
OK, then let's put it as third argument, after info.
@vferat Sounds good to you?
Hey @RunKZhang, thanks for your contribution !
In my opinion, fitted_data and labels should be optional, as it can happen that only cluster centers (i.e. microstates maps) are shared from one study to another.
fitted_datacan be defined iflabelsis not defined (for example, if the maps come from a non-clustering algorithm such as ICA, then there are no labels).labelsmust not be defined iffitted_datais not defined.- If both
fitted_dataandlabelsare defined, thengev_should be calculated automatically. - If neither
fitted_datanorlabelsare defined, an specific error should be raise when using clustering metric. - We should add another argument
ignore_polarityto know if the imported maps were computed/are intended to be used with or without polarity (i.e. #93) - We need to make sure that saving in
fifformat takes these different possibilities into account.
What do you think about it ?
Otherwise, the rest of your discussion seems good.
Hello @vferat , I agree with your idea!
As you have mentioned, there are no labels if the cluster_centers are from ICA or other algorithms, and I have to admit that your design of this feature is thoughtful. For the remaining design, I have no objection to them.
Finally, I am grateful my suggestions is taken into consideration. Thank you guys so much!
OK, then let's go with this class signature:
class ClusterArray(_BaseCluster):
def __init__(
data: NDArray[float],
info: Union[Info, ChInfo],
cluster_names: Optional[List[str]] = None,
fitted_data: Optional[NDArray[float]] = None,
labels: Optional[NDArray[int]] = None,
ignore_polarity: bool = False,
):
I think ignore_polarity=False is the sensible default, @vferat?
We touch a sensitive point here:
- We do not support
.predict()forignore_polarity=False(yet?). SoClusterArray(ignore_polarity=False)would be of no use. ignore_polarityis a critical parameter that may completely change the analysis. I would rather ensure the user has set it on purpose, than relying on an optional parameter.
The solution I see is to set ignore_polarity as a mandatory argument, and throw a NotImplementedError error if someone tries to set ClusterArray(ignore_polarity=False) for now.
OK, that works for me.
I picked up this PR, and started with (1) clean-up, (2) input validation on the class, (3) I/O roundtrip to FIFF. No test for now, but that should be a good basis. @vferat Can you have a first look to the main class and to what could be missing. Mostly, could we compute fitting variables from the information provided with this signature, e.g. GEV?
But if we have fitted_data, we should be able to compute the labels? So in the end, we compute the GEV in all cases except if fitted_data is not provided?
But if we have
fitted_data, we should be able to compute thelabels? So in the end, we compute the GEV in all cases except iffitted_datais not provided?
It could be, but there are some cases where the process is not straight forward:
- some clustering algorithms assign sample to a cluster even if the distance between the sample and the choosen cluster center is the smallest ( for example a clustering algorithm using another distance definition, or a clsuter algorithm taking into account other parameters such as sample order..)
- some algorithms can extract maps without labels (i.e. ICA).
These special cases make it difficult to decide whether labels should be automatically calculated. I'm open to discussion, as I have no arguments either for or against.
Sounds like labels should not be auto-calculated.