embed FR: For each of the UMAP clusters, information/ID on values (from which columns) assigned to which UMAP clusters would be nice

FR: For each of the UMAP clusters, information/ID on values (from which columns) assigned to which UMAP clusters would be nice

Open exsell-jc opened this issue 3 years ago • 4 comments

Scenario: when using step_umap() (I guess other methods like step_knn() can apply), they are assigned to different clusters.

It would be really nice to know the following:

Which values from which columns are assigned to that/each cluster
Affinity of columns (e.g. majority of the columns' values) belonging to a particular cluster

For my particular case, 2) is more interesting to me; 1) is more details.

To clarify, consider the following scenario:

Let's say a vector of 10 columns are reduced to 4 clusters.
Column X has 90% of the values in cluster A, 8% of values in cluster B, and 2% of values in cluster C
Column X's assigned cluster = cluster A, i.e. columns X's highest affinity = cluster A
Similar for column Y, Z, etc. for all 10 columns
My input = cluster A; output (what I would like to know) = the columns that have highest affinity to cluster A

Essentially, the last bullet point is what I would like the new feature to be.

Thank you for reading

Apr 21 '22 08:04 exsell-jc

The UMAP algorithm doesn't really have the concept of a cluster the way you are talking about here. You can read more here in the Python docs; notice what color is being mapped to (the label, not any output from UMAP). You need to cluster on top of the UMAP results, like with HDBSCAN in that example.

If you would like to get out cluster assignments from k-means, take a look at this article (you'll want to augment() your original data points).

Apr 25 '22 16:04 juliasilge

The UMAP algorithm doesn't really have the concept of a cluster the way you are talking about here. You can read more here in the Python docs; notice what color is being mapped to (the label, not any output from UMAP). You need to cluster on top of the UMAP results, like with HDBSCAN in that example.

If you would like to get out cluster assignments from k-means, take a look at this article (you'll want to augment() your original data points).

Thanks, really nice to know about augment().

Do you happen to know an alternative that's not distance based like KNN? With more sample size, KNN would not really work.

Apr 26 '22 11:04 exsell-jc

Have you taken a look at something like mclust? Or this Stack Overflow answer outlines some nice options.

Apr 26 '22 16:04 juliasilge

UMAP is sort of distance based, just on a complex manifold.

I think that the main problem is defining the membership function. With classical clustering methods, we would look at the distance to each class centroid. For UMAP the notions of distance and centroid are not well defined.

Apr 27 '22 17:04 topepo

It also might be worth it to look at https://tidyclust.tidymodels.org/, which is our package for dealing with clustering problem.

I'm going to close this issue for now. If you have any further problem/questions/praise feel free to open another issue!

Mar 07 '23 21:03 EmilHvitfeldt

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

Mar 22 '23 01:03 github-actions[bot]

embed embed copied to clipboard

FR: For each of the UMAP clusters, information/ID on values (from which columns) assigned to which UMAP clusters would be nice

embed
embed copied to clipboard