pycrostates icon indicating copy to clipboard operation
pycrostates copied to clipboard

Add ClusterArray class to import any kind of microstates maps

Open RunKZhang opened this issue 2 years ago • 15 comments
trafficstars

Hello, this is the new feature I described in Issues, and I open a pull requests here. This is a draft of the thoughts, and not heavily tested.

Closes #117

RunKZhang avatar Aug 02 '23 12:08 RunKZhang

I think that addition to the API would be very welcome. My initial impression is that it should mimic the ClassArray API from MNE-Python. Examples:

In all 5 cases, the class signature is ClassArray(data, info, optional_arguments). Thus, I would define a ClusterArray class which would ingest fitted data.

class ClusterArray(_BaseCluster):

    def __init__(
        data: NDArray[float], 
        info: Union[Info, ChInfo], 
        cluster_names: Optional[List[str]] = None, 
        fitted_data: Optional[NDArray[float]] = None,
        labels: Optional[NDArray[int]] = None,
    ):

Where data would be the cluster_centers_ (named data to match MNE-Python's API). The number of clusters n_clusters can be inferred from data. Note that I've set cluster_names, fitted_data, and labels as Optional. I don't think they are key elements that we require to run other methods/functions from the pycrostates API. They are "nice-to-have" variables, but not mandatory. I could be wrong however, I would have to check (especially for labels).

WDYT @RunKZhang ?

mscheltienne avatar Aug 02 '23 13:08 mscheltienne

@RunKZhang I fixed my proposed class signature in the previous post, I defined it as a function for some reason :see_no_evil:

mscheltienne avatar Aug 02 '23 13:08 mscheltienne

I agree with you for the ClassArray, and data could be cluster_centers_. But I think labels should be mandatory, since it is essential to calculate gev_.

RunKZhang avatar Aug 03 '23 01:08 RunKZhang

Sorry I don't get why define it as a function. Since the predict method helps make segmentation of the raw object, and it is inherited from _BaseCluster, I think define it as a class is preferable.

RunKZhang avatar Aug 03 '23 01:08 RunKZhang

Sorry I don't get why define it as a function. Since the predict method helps make segmentation of the raw object, and it is inherited from _BaseCluster, I think define it as a class is preferable.

Definitely! I just made a mistake in my original comment. prior to edit, where I wrote:

class ClusterArray(
    data: NDArray[float], 
    info: Union[Info, ChInfo], 
    cluster_names: Optional[List[str]] = None, 
    fitted_data: Optional[NDArray[float]] = None,
    labels: Optional[NDArray[int]] = None,
):

which was obviously wrong!

mscheltienne avatar Aug 03 '23 08:08 mscheltienne

But I think labels should be mandatory, since it is essential to calculate gev_

OK, then let's put it as third argument, after info. @vferat Sounds good to you?

mscheltienne avatar Aug 03 '23 08:08 mscheltienne

Hey @RunKZhang, thanks for your contribution !

In my opinion, fitted_data and labels should be optional, as it can happen that only cluster centers (i.e. microstates maps) are shared from one study to another.

  • fitted_data can be defined if labels is not defined (for example, if the maps come from a non-clustering algorithm such as ICA, then there are no labels).
  • labels must not be defined if fitted_data is not defined.
  • If both fitted_data and labels are defined, then gev_ should be calculated automatically.
  • If neither fitted_data nor labels are defined, an specific error should be raise when using clustering metric.
  • We should add another argument ignore_polarity to know if the imported maps were computed/are intended to be used with or without polarity (i.e. #93)
  • We need to make sure that saving in fif format takes these different possibilities into account.

What do you think about it ?

Otherwise, the rest of your discussion seems good.

vferat avatar Aug 06 '23 13:08 vferat

Hello @vferat , I agree with your idea!

As you have mentioned, there are no labels if the cluster_centers are from ICA or other algorithms, and I have to admit that your design of this feature is thoughtful. For the remaining design, I have no objection to them.

Finally, I am grateful my suggestions is taken into consideration. Thank you guys so much!

RunKZhang avatar Aug 06 '23 13:08 RunKZhang

OK, then let's go with this class signature:

class ClusterArray(_BaseCluster):

    def __init__(
        data: NDArray[float], 
        info: Union[Info, ChInfo], 
        cluster_names: Optional[List[str]] = None, 
        fitted_data: Optional[NDArray[float]] = None,
        labels: Optional[NDArray[int]] = None,
        ignore_polarity: bool = False,
    ):

I think ignore_polarity=False is the sensible default, @vferat?

mscheltienne avatar Aug 07 '23 08:08 mscheltienne

We touch a sensitive point here:

  • We do not support .predict() for ignore_polarity=False (yet?). So ClusterArray(ignore_polarity=False) would be of no use.
  • ignore_polarity is a critical parameter that may completely change the analysis. I would rather ensure the user has set it on purpose, than relying on an optional parameter.

The solution I see is to set ignore_polarity as a mandatory argument, and throw a NotImplementedError error if someone tries to set ClusterArray(ignore_polarity=False) for now.

vferat avatar Aug 08 '23 12:08 vferat

OK, that works for me.

mscheltienne avatar Aug 08 '23 12:08 mscheltienne

I picked up this PR, and started with (1) clean-up, (2) input validation on the class, (3) I/O roundtrip to FIFF. No test for now, but that should be a good basis. @vferat Can you have a first look to the main class and to what could be missing. Mostly, could we compute fitting variables from the information provided with this signature, e.g. GEV?

mscheltienne avatar Mar 21 '24 14:03 mscheltienne

But if we have fitted_data, we should be able to compute the labels? So in the end, we compute the GEV in all cases except if fitted_data is not provided?

mscheltienne avatar Mar 23 '24 11:03 mscheltienne

But if we have fitted_data, we should be able to compute the labels? So in the end, we compute the GEV in all cases except if fitted_data is not provided?

It could be, but there are some cases where the process is not straight forward:

  • some clustering algorithms assign sample to a cluster even if the distance between the sample and the choosen cluster center is the smallest ( for example a clustering algorithm using another distance definition, or a clsuter algorithm taking into account other parameters such as sample order..)
  • some algorithms can extract maps without labels (i.e. ICA).

These special cases make it difficult to decide whether labels should be automatically calculated. I'm open to discussion, as I have no arguments either for or against.

vferat avatar Mar 27 '24 19:03 vferat

Sounds like labels should not be auto-calculated.

mscheltienne avatar Mar 27 '24 19:03 mscheltienne