yellowbrick
yellowbrick copied to clipboard
Extend the InterclusterDistance visualizer
The InterclusterDistance visualizer is our newest cluster visualization, and while it's been implemented completely, there are still a few updates I'd like to make to it:
- [ ] Ensure it works for a large range of clustering algorithms (and remove skipped tests) -- see below
- [ ] Add custom Principal Coordinate Analysis (PCoA) embedding based on an internal PyLDAViz implementation of Jensen-Shannon Divergence.
- [ ] Add Vanilla PCoA embedding (which may already be in Scikit-Learn)
- [ ] Investigate other scoring mechanisms besides # of instances (as in PyLDAViz for LDA, possibly Silhouette scores, something using y, or cluster diameter) and create a new issue for them or implement them.
- [ ] Allow user to set color of clusters and relative opacity, computing the edge and face color opacities from the specified colors and opacity and setting them correctly as is done with hard coding now.
- [ ] Update the notes section of the visualizer with new embedding and scoring when they are complete!
- [ ] Create documentation example using either sklearn newsgroups corpus or hobbies corpus vectorized as TF-IDF and clustered with LDA, to show topic modeling approach similar to PyLDAViz.
Notes on colors
Right now the facecolor of the clusters is hard coded to #2e719344 and the edgecolor of the clusters is hard coded to #2e719399 note the 44 and 99 on the colors respectively, these set the opacity of the color; the edge is more opaque than the face of the cluster in order to allow better visibility of clusters that overlap.
I would like to support the user specifying a color for all clusters or a colormap/colors for each cluster as well as the ability to specify the face opacity. If the user specifies these things, then we have to compute the relative alpha (opacity) for both the edge and the face to maintain the currently hardcoded behavior.
Notes on supported algorithms
Right now we use the cluster_centers_ attribute of the model to embed the centers into 2 dimensional space and the labels_ attribute to score/size the clusters. Unfortunately, not all clustering algorithms have these attributes, so we need to extend the cluster_center_ property on the visualizer to either find a different attribute or to compute the cluster centers some how. Below is a listing of various clustering algorithms and their attributes.
We would like to ensure support for the following clustering algorithms:
AgglomerativeClustering (Ward and Average)
- children_
- labels_
- n_components_
- n_leaves_
Birch
- dummy_leaf_
- fit_
- labels_
- partial_fit_
- root_
- subcluster_centers_
- subcluster_labels_
FeatureAgglomeration
- children_
- labels_
- n_components_
- n_leaves_
decomposition.LatentDirichletAllocation
- bound_
- components_
- doc_topic_prior_
- exp_dirichlet_component_
- n_batch_iter_
- n_iter_
- random_state_
- topic_word_prior_
It would be great if we could find support for the following clustering algorithms, but it's not clear if it's possible or not either because there is no obvious centers or labels:
DBSCAN
- components_
- core_sample_indices_
- labels_
mixture.GaussianMixture
- converged_
- covariances_
- lower_bound_
- means_
- n_iter_
- precisions_
- precisions_cholesky_
- weights_
SpectralClustering
- affinity_matrix_
- labels_
We already have support for the following clustering algorithms (using the cluster_centers_ attribute for embedding and the labels_ attribute for scoring):
AffinityPropagation
- affinity_matrix_
- cluster_centers_
- cluster_centers_indices_
- labels_
- n_iter_
KMeans
- cluster_centers_
- inertia_
- labels_
- n_iter_
MiniBatchKMeans
- cluster_centers_
- counts_
- inertia_
- init_size_
- labels_
- n_iter_
MeanShift
- cluster_centers_
- labels_
Greetings! Can I use it via Anaconda? Can't install 0.9 version of yellowbrick in order to use InterclusterDistance. If I can't - tell me, please, if there is another easy way to find intercluster distance of sckit's k-means. Thanks!
@jaywalkingbackwards we haven’t deployed v0.9 to anaconda yet. It is one of our highest priorities. I am not aware of a different way to find intercluster distance. Have you taken a look at our code? https://github.com/DistrictDataLabs/yellowbrick/blob/develop/yellowbrick/cluster/icdm.py
@bbengfort or @rebeccabilbro any comments?
@jaywalkingbackwards version 0.9 has been released to conda - if you update your Yellowbrick install you should have access to ICDM now!