NMF icon indicating copy to clipboard operation
NMF copied to clipboard

Alternatives to consensushc for predict from NMFfitX

Open DaGaMs opened this issue 9 years ago • 1 comments

Hi Renaud,

I'm trying to reproduce a clustering similar to the one described in this paper. I believe that the first part of their algorithm is identical to a nmf run, but their approach to derive clusters from multiple runs seems different. To quote:

_Cluster_ A partition-clustering algorithm was applied to the set of matrices S_P to cluster the data into N clusters. A variation of k-means, where each signature for ∀P∈Sp is assigned to exactly one cluster, was used to partition the data. Similarities between mutational signatures were calculated using a cosine similarity (see below) whereas the N centroids were calculated by averaging the signatures belonging to each cluster. The iteration-averaged matrix P was formed by combining the N centroid vectors ordered by their reproducibility (see Step 6). The error bars reported for each mutation type in each signature in P were calculated as the SD of the corresponding mutation type in each centroid over the I iterations. Note that clustering the data in S_P effectively results in clustering S_E as each signature unambiguously corresponds to exactly one exposure, thus allowing derivation of E. _Evaluate_ The reproducibility of the derived average signatures P is evaluated by examining the tightness and separation of the clusters used to form the centroids in P (see Step 5). More specifically, using cosine similarity, the average silhouette width for each of the N clusters is calculated. An average silhouette width of 1.00 is equivalent to consistently deciphering the same mutational signature, whereas a low silhouette width indicates lack of reproducibility of the solution. The average silhouette width (Rousseeuw, 1987) of the N clusters is used as a measure of reproducibility for the whole solution.

I freely admit that I don't really understand the math behind these two paragraphs, but it seems to me to be not quite the same as the hierarchical clustering that is performed in the predict function. Do you think this alternative approach could be implemented in the NMF package in the future? FWIW, the Matlab code for their method is available here and actually somewhat readable.

Thanks in any case for having provided this fantastic library which I use almost daily. Best wishes,

Ben

DaGaMs avatar Dec 05 '14 00:12 DaGaMs