document_cluster
document_cluster copied to clipboard
Unable to get the top n words nearest to the cluster centroid.
Thank you so much for posting such detailed tutorial !
I am trying to use this to cluster news content. I have 275449 news contents that I need to cluster. The structure of my data is pretty similar to yours. I have news content Id and description (I don't have a ranking concept that you have in your data).
I followed all the steps as per your guide but when I tried to print the top n words nearest to the cluster centroid, it gave me a weird output. It printed the same combination of words in a specific format, with special characters etc.
In fact, I tried running this by creating very small test dataset, with just 10 records, but ended up with the same output.
Cluster 0 words: b'good', b'weather', b'game',
Cluster 0 ContentID: 1, 6,
Cluster 1 words: b'weather', b'good', b'game',
Cluster 1 ContentID: 3, 5, 8, 10,
Cluster 2 words: b'game', b'weather', b'good',
Cluster 2 ContentID: 2, 7,
Cluster 3 words: b'weather', b'good', b'game',
Cluster 3 ContentID: 4, 9,
Could you please help me to fix this.
Appreciate your help on this !
Hi @MaheshwaranK it's hard to diagnose what's going on without seeing your code and the data format. If you could share that it would help. Otherwise, could you share a sample of the vocab_frame and order_centroids?
Hello Brandon,
Please find attached the data files (one of the data files is attached as a google drive link) and the code.
-
cosine_similarity step failed due to memory error. (I still went ahead and ran the clustering).
-
It just gives 2 clusters (majority of the records are part of 1 cluster). I tried changing the max_df & min_df and it gives more clusters but still, majority of the records are part of 1 cluster.
-
I am not able to print the top n words nearest to the cluster centroid. It always gives the same combination of words with some special character.
Could you please take a look at it and help me out to fix these issues.
Appreciate your help on this !
Thanks, Mahesh
abc1.txt Code.docx abc2.txt https://drive.google.com/file/d/0B52HiOQsrUMfTkFkLUR4dGVBc0k/view?usp=drive_web
Any suggestions here on how to fix the three issues that I posted earlier? Thanks !
Can you post the code in line here or upload a notebook to nbviewer? I prefer not to download anything.
Here is the code, It's pretty much the same that has been used in your movie clustering based on reviews.
import numpy as np import pandas as pd import nltk import re import os import codecs from sklearn import feature_extraction
abc1 = [] infile = open('C:/Users/Owner/Desktop/MSBA/Marketing Management/Project/abc1.txt', 'r') for line in infile: abc1.append(line.strip()) infile.close()
abc2 = [] infile = open('C:/Users/Owner/Desktop/MSBA/Marketing Management/Project/abc2.txt', 'r', encoding="Latin-1") for line in infile: abc2.append(line.strip()) infile.close()
print (abc1[:10]) print (abc2[0][:50])
['1000050', '1000058', '1000066', '1000082', '1000094', '1000096', '1000100', '1000132', '100014', '1000140'] Passengers on a Boeing 737 operated by budget airl stopwords = nltk.corpus.stopwords.words('english') print (stopwords[:30]) ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what']
from nltk.stem.snowball import SnowballStemmer stemmer = SnowballStemmer("english")
here I define a tokenizer and stemmer which returns the set of stems in the text that it is passed
def tokenize_and_stem(text): # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)] filtered_tokens = [] # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation) for token in tokens: if re.search('[a-zA-Z]', token): filtered_tokens.append(token) stems = [stemmer.stem(t) for t in filtered_tokens] return stems
def tokenize_only(text): # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)] filtered_tokens = [] # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation) for token in tokens: if re.search('[a-zA-Z]', token): filtered_tokens.append(token) return filtered_tokens #not super pythonic, no, not at all. #use extend so it's a big flat list of vocab totalvocab_stemmed = [] totalvocab_tokenized = [] for i in abc2: allwords_stemmed = tokenize_and_stem(i) #for each item in 'synopses', tokenize/stem totalvocab_stemmed.extend(allwords_stemmed) #extend the 'totalvocab_stemmed' list
allwords_tokenized = tokenize_only(i)
totalvocab_tokenized.extend(allwords_tokenized)
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed) print ('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame') there are 7001313 items in vocab_frame
print (vocab_frame.head()) words passeng passengers on on a a boe boeing oper operated
from sklearn.feature_extraction.text import TfidfVectorizer
#define vectorizer parameters tfidf_vectorizer = TfidfVectorizer(max_df=0.80, max_features=200000, min_df=0.20, stop_words='english', use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
%time tfidf_matrix = tfidf_vectorizer.fit_transform(abc2) #fit the vectorizer to synopses
print(tfidf_matrix.shape) Wall time: 4min 33s (275449, 1) terms = tfidf_vectorizer.get_feature_names()
from sklearn.metrics.pairwise import cosine_similarity dist = 1 - cosine_similarity(tfidf_matrix)
MemoryError Traceback (most recent call last)
C:\Users\Owner\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in cosine_similarity(X, Y, dense_output) 887 Y_normalized = normalize(Y, copy=True) 888 --> 889 K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output) 890 891 return K
C:\Users\Owner\Anaconda3\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output) 177 """ 178 if issparse(a) or issparse(b): --> 179 ret = a * b 180 if dense_output and hasattr(ret, "toarray"): 181 ret = ret.toarray()
C:\Users\Owner\Anaconda3\lib\site-packages\scipy\sparse\base.py in mul(self, other) 351 if self.shape[1] != other.shape[0]: 352 raise ValueError('dimension mismatch') --> 353 return self._mul_sparse_matrix(other) 354 355 try:
C:\Users\Owner\Anaconda3\lib\site-packages\scipy\sparse\compressed.py in _mul_sparse_matrix(self, other) 494 maxval=nnz) 495 indptr = np.asarray(indptr, dtype=idx_dtype) --> 496 indices = np.empty(nnz, dtype=idx_dtype) 497 data = np.empty(nnz, dtype=upcast(self.dtype, other.dtype)) 498
MemoryError:
from sklearn.cluster import KMeans
num_clusters = 7
km = KMeans(n_clusters=num_clusters)
%time km.fit(tfidf_matrix)
clusters = km.labels_.tolist() Wall time: 15.1 s from sklearn.externals import joblib
#uncomment the below to save your model #since I've already run my model I am loading from the pickle
joblib.dump(km, 'doc_cluster.pkl')
km = joblib.load('doc_cluster.pkl') clusters = km.labels_.tolist()
abc = { 'ContentID': abc1, 'ContentDesc': abc2, 'Cluster': clusters }
frame = pd.DataFrame(abc, index = [clusters] , columns = ['ContentID','Cluster']) frame['Cluster'].value_counts() #number of films per cluster (clusters from 0 to 4)
Out[17]: 1 193418 0 82031 Name: Cluster, dtype: int64
from future import print_function
print("Top terms per cluster:") print() #sort cluster centers by proximity to centroid order_centroids = km.cluster_centers_.argsort()[:, ::-1]
for i in range(num_clusters):
print("Cluster %d words:" % i, end='')
for ind in order_centroids[i, :3]: #replace 6 with n words per cluster
print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=',')
print() #add whitespace
print() #add whitespace
print() print() Top terms per cluster:
Cluster 0 words: b"'s",
Cluster 1 words: b"'s",
Cluster 2 words: b"'s",
Cluster 3 words: b"'s",
Cluster 4 words: b"'s",
Cluster 5 words: b"'s",
Cluster 6 words: b"'s",
The issue is somewhere with this code:
tfidf_vectorizer = TfidfVectorizer(max_df=0.80, max_features=200000,
min_df=0.20, stop_words='english',
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
%time tfidf_matrix = tfidf_vectorizer.fit_transform(abc2) #fit the vectorizer to synopses
print(tfidf_matrix.shape)
Wall time: 4min 33s
(275449, 1)
Notice that the tfidf_matrix shape is (n, 1) where n is the length of abc2. 1 in this case should be much higher--it should represent terms found within the documents matching the specifications. I'm not sure what's happening here but that is the problem. It is possible that nothing is matching your parameters for min_df and max_df. You should try lowering and raising them respectively.
DId you resolve this? If so let me know so I can close the issue