document_cluster
document_cluster copied to clipboard
Why distances are calculated twice?
Hi,
Thank for the great tutorial on document clustering. I am pretty new to text analytics and wanted to ask if there is a reason that distances are calculated twice for hierarchical document clustering? First here on the `tfidf_matrix' using cosine distance:
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)
and second time here over the dist through ward function that runs euclidean distance before doing the ward linkage:
linkage_matrix = ward(dist)
Is this something specially done for text clustering?
Thanks again