document_cluster icon indicating copy to clipboard operation
document_cluster copied to clipboard

Why distances are calculated twice?

Open NaserMonsefi opened this issue 8 years ago • 0 comments

Hi,

Thank for the great tutorial on document clustering. I am pretty new to text analytics and wanted to ask if there is a reason that distances are calculated twice for hierarchical document clustering? First here on the `tfidf_matrix' using cosine distance:

from sklearn.metrics.pairwise import cosine_similarity dist = 1 - cosine_similarity(tfidf_matrix)

and second time here over the dist through ward function that runs euclidean distance before doing the ward linkage:

linkage_matrix = ward(dist)

Is this something specially done for text clustering?

Thanks again

NaserMonsefi avatar Aug 15 '17 08:08 NaserMonsefi