Attractors.jl
Attractors.jl copied to clipboard
Optimize feature selection for clustering (or in general finding attractors)
I have an idea of how we could make a function to optimize feature selection for clustering attractors. I will be creating this function while working on a new research project. But I think it is useful to discuss possibilities of how to make it happen.
My idea is as follows:
- We have some "function" that returns a lot of features, like 20 or so.
- From these features we want to find the 3 or 4 ones that are the most efficient into clustering data into separated attractors.
- We could perform an optimization loop: pick 3 features at random, cluster the data, with these feaures, and the computed a clustering quality metric. Keep the three features that have the highest quality.
The main question therefore is how do we define a "cluster quality metric"? On one hand, this metric needs to have a contribution of cluster separation: the more separated the clusters are, the more clear the clustering. This could be done by computing the pairwise distribution distance of the pdfs fitted to every cluster, and keep as the metric the median of this distribution.
However, this cluster quality metric should also have some short of contribution from the amount of clusters found. The more clusters the better...?
cc also @KalelR I think this is relevant for you too.
Hey, great! I think it's a cool project, which can extend to finding relevant features in general for time-series.
Our experience from using DBSCAN to find attractors is that it can be quite unclear when the clustering works. We had mainly to rely on the silhouettes index, which is also far from perfect, and quite unintuitive. Another complication is that it would be very expensive to find the most efficient set of features from the clustering by doing this sampling from lots of features. So maybe it would be nice to come up with a technique that doesn't rely on clustering like DBSCAN.
In the systems I've been studying, it hasn't been tremendously difficult to find nice features that are constant, or at least stationary, on the attractor, so that their time-averages were well-behaved, and ics on the same attractor were basically the same. In this scenario, feature space is composed of very densely populated clouds (almost single points) that are all well separated between themselves, and a simple histogram (or the similar idea I did in PR #84) works nicely. Maybe a similar idea could work here, and one could look at the histograms on individual features and from there decide the best ones. Not sure what the best metric would be, though. Also not sure in general how robust this would be to noisy data.
The main question therefore is how do we define a "cluster quality metric"?
I wonder how well the clustering would work if applied on all the available features simultaneously, i.e. on this high-dimensional feature space. If it works well (the histogram would, I guess), then it could be taken as a ground truth to then compare the single-feature analysis to.
Some additional thoughts:
Be aware of features spanning different orders of magnitude, especially if you are computing distances. Features going in the interval (0, 1e3) will over-contribute compared to features in (0, 1e-1) for instance. A problem we haven't yet solved is how to properly deal with this. The current rescale_to_01
isn't very good (cf. PR #59).
A useful property of the features that could enter the metric is if they are constant, or at least stationary in time.
Would be happy to follow this and discuss more at some point!
PS: sorry for the delay, had a trip this week.
From these features we want to find the 3 or 4 ones that are the most efficient into clustering data into separated attractors.
I guess you want to do use some Integer programming technique. The setup is as follows: for 20 features you have a vector of 20 value {0,1} where 1 means "use this parameter" and 0 "discard this one". The program search the vector of {0,1} that optimizes the fitness function (the clustering metric}. It will select the features that optimizes the metrics.
The hard stuff is to define the right clustering quality metric.
Another complication is that it would be very expensive to find the most efficient set of features from the clustering by doing this sampling from lots of features. So maybe it would be nice to come up with a technique that doesn't rely on clustering like DBSCAN.
The optimization to select the best features can be done with DBSCAN on a small random sample of vectors. The optimization shouldn't take long if the problem is convex. If it is not convex it is the worst case scenario and you may find a suboptimal solution.
Another complication is that it would be very expensive to find the most efficient set of features from the clustering by doing this sampling from lots of features.
In my application this is not a problem. I have about 100 trajectories in total.
The current rescale_to_01 isn't very good (cf. PR https://github.com/JuliaDynamics/Attractors.jl/pull/59).
What do you mean? Why?
Maybe a similar idea could work here, and one could look at the histograms on individual features and from there decide the best ones. Not sure what the best metric would be, though. Also not sure in general how robust this would be to noisy data. I wonder how well the clustering would work if applied on all the available features simultaneously, i.e. on this high-dimensional feature space. If it works well (the histogram would, I guess), then it could be taken as a ground truth to then compare the single-feature analysis to.
But I haven't really understood of a way to judge the quality of a clustering in your answer. So, there is therefore no way to compare whether some features are good or not. You say "compare" above, but how? Technically how. As I said in my original post, I need of a way to numerically estimate "goodness" of a clustering output. I guess I'll open a post on discourse and see if someone from statistics has an idea of how to do this. Alex put it very well:
The hard stuff is to define the right clustering quality metric.
I guess you want to do use some Integer programming technique. The setup is as follows: for 20 features you have a vector of 20 value {0,1} where 1 means "use this parameter" and 0 "discard this one". The program search the vector of {0,1} that optimizes the fitness function (the clustering metric}. It will select the features that optimizes the metrics.
Yes, but this is the easy part. I can just use Combinatorics.jl to make an iterator for me.
ref. https://discourse.julialang.org/t/real-number-metric-for-quantifying-the-quality-of-a-clustering-output/102385
@KalelR apparently the metric I am searching for is just the silhuette (see the post above). But I remember that here we optimize differently using Optim.jl .
So I guess the numerical quantity I am searching for is just the minimum of the optimizer here: https://github.com/JuliaDynamics/Attractors.jl/blob/main/src/mapping/grouping/cluster_optimal_%CF%B5.jl#L116
So I guess I should just do a simple PR that makes it so that this information is stored somewhere in the GroupViaClustering
struct.
I did some analysis here. I am keeping tract of the -Optim.minimum(opt)
from this line: https://github.com/JuliaDynamics/Attractors.jl/blob/main/src/mapping/grouping/cluster_optimal_%CF%B5.jl#L116 .
It appears to me that just this by itself is not enough. Here are some examples from the data I am using:
(left are the timeseries. The features are always the standard deviation of the timeseries for now. The dataset has 3 attractors, we know this a-priori. We want to see which observables (i.e., "features") are the ones that separate the attractors best)
optimal val = 0.84
optimal val = 0.87
optimal val = 0.95
this image is pretty interesting because it shows that only one feature is sufficient to cluster this data into 3 attractors (you will notice that from these 3 features they are almost perfectly linearly correlated)
My results are that the optimal sillhuete value indeed correlates with "cluster separation quality", however this metric by itself is probably not enough. E.g., in the following example I have the least amount of optimal value, yet still find 3 clusters:
optimal value = 0.8
My main difficulty is that the silhuette quality metric has so little variability in it. I need to process this. Ideally I want a metric that rewards both cluster quality and cluster amount. And it also rewards using less feature dimensions. I want to test if just using the formula "optimal value * number of attractors / number of feature dimensions" is enough.