Thank you so much for posting such detailed tutorial !

I am trying to use this to cluster news content. I have 275449 news contents that I need to cluster. The structure of my data is pretty similar to yours. I have news content Id and description (I don't have a ranking concept that you have in your data).

I followed all the steps as per your guide but when I tried to print the top n words nearest to the cluster centroid, it gave me a weird output. It printed the same combination of words in a specific format, with special characters etc.

In fact, I tried running this by creating very small test dataset, with just 10 records, but ended up with the same output.

Cluster 0 words: b'good', b'weather', b'game',

Cluster 0 ContentID: 1, 6,

Cluster 1 words: b'weather', b'good', b'game',

Cluster 1 ContentID: 3, 5, 8, 10,

Cluster 2 words: b'game', b'weather', b'good',

Cluster 2 ContentID: 2, 7,

Cluster 3 words: b'weather', b'good', b'game',

Cluster 3 ContentID: 4, 9,

Could you please help me to fix this.

Appreciate your help on this !

Jun 24 '17 23:06 MaheshwaranK

Hi @MaheshwaranK it's hard to diagnose what's going on without seeing your code and the data format. If you could share that it would help. Otherwise, could you share a sample of the vocab_frame and order_centroids?

Jun 25 '17 18:06 brandomr

Hello Brandon,

Please find attached the data files (one of the data files is attached as a google drive link) and the code.

cosine_similarity step failed due to memory error. (I still went ahead and ran the clustering).
It just gives 2 clusters (majority of the records are part of 1 cluster). I tried changing the max_df & min_df and it gives more clusters but still, majority of the records are part of 1 cluster.
I am not able to print the top n words nearest to the cluster centroid. It always gives the same combination of words with some special character.

Could you please take a look at it and help me out to fix these issues.

Appreciate your help on this !

Thanks, Mahesh

abc1.txt Code.docx abc2.txt https://drive.google.com/file/d/0B52HiOQsrUMfTkFkLUR4dGVBc0k/view?usp=drive_web

Jun 25 '17 20:06 MaheshwaranK

Any suggestions here on how to fix the three issues that I posted earlier? Thanks !

Jul 01 '17 17:07 MaheshwaranK

Can you post the code in line here or upload a notebook to nbviewer? I prefer not to download anything.

Jul 02 '17 19:07 brandomr

Here is the code, It's pretty much the same that has been used in your movie clustering based on reviews.

import numpy as np import pandas as pd import nltk import re import os import codecs from sklearn import feature_extraction

abc1 = [] infile = open('C:/Users/Owner/Desktop/MSBA/Marketing Management/Project/abc1.txt', 'r') for line in infile: abc1.append(line.strip()) infile.close()

abc2 = [] infile = open('C:/Users/Owner/Desktop/MSBA/Marketing Management/Project/abc2.txt', 'r', encoding="Latin-1") for line in infile: abc2.append(line.strip()) infile.close()

print (abc1[:10]) print (abc2[0][:50])

['1000050', '1000058', '1000066', '1000082', '1000094', '1000096', '1000100', '1000132', '100014', '1000140'] Passengers on a Boeing 737 operated by budget airl stopwords = nltk.corpus.stopwords.words('english') print (stopwords[:30]) ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what']

from nltk.stem.snowball import SnowballStemmer stemmer = SnowballStemmer("english")

here I define a tokenizer and stemmer which returns the set of stems in the text that it is passed

def tokenize_and_stem(text): # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)] filtered_tokens = [] # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation) for token in tokens: if re.search('[a-zA-Z]', token): filtered_tokens.append(token) stems = [stemmer.stem(t) for t in filtered_tokens] return stems

def tokenize_only(text): # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)] filtered_tokens = [] # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation) for token in tokens: if re.search('[a-zA-Z]', token): filtered_tokens.append(token) return filtered_tokens #not super pythonic, no, not at all. #use extend so it's a big flat list of vocab totalvocab_stemmed = [] totalvocab_tokenized = [] for i in abc2: allwords_stemmed = tokenize_and_stem(i) #for each item in 'synopses', tokenize/stem totalvocab_stemmed.extend(allwords_stemmed) #extend the 'totalvocab_stemmed' list

allwords_tokenized = tokenize_only(i)
totalvocab_tokenized.extend(allwords_tokenized)

vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed) print ('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame') there are 7001313 items in vocab_frame

print (vocab_frame.head()) words passeng passengers on on a a boe boeing oper operated

from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters tfidf_vectorizer = TfidfVectorizer(max_df=0.80, max_features=200000, min_df=0.20, stop_words='english', use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

%time tfidf_matrix = tfidf_vectorizer.fit_transform(abc2) #fit the vectorizer to synopses

print(tfidf_matrix.shape) Wall time: 4min 33s (275449, 1) terms = tfidf_vectorizer.get_feature_names()

from sklearn.metrics.pairwise import cosine_similarity dist = 1 - cosine_similarity(tfidf_matrix)

MemoryError Traceback (most recent call last) in () 1 from sklearn.metrics.pairwise import cosine_similarity ----> 2 dist = 1 - cosine_similarity(tfidf_matrix)

C:\Users\Owner\Anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in cosine_similarity(X, Y, dense_output) 887 Y_normalized = normalize(Y, copy=True) 888 --> 889 K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output) 890 891 return K

C:\Users\Owner\Anaconda3\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output) 177 """ 178 if issparse(a) or issparse(b): --> 179 ret = a * b 180 if dense_output and hasattr(ret, "toarray"): 181 ret = ret.toarray()

C:\Users\Owner\Anaconda3\lib\site-packages\scipy\sparse\base.py in mul(self, other) 351 if self.shape[1] != other.shape[0]: 352 raise ValueError('dimension mismatch') --> 353 return self._mul_sparse_matrix(other) 354 355 try:

C:\Users\Owner\Anaconda3\lib\site-packages\scipy\sparse\compressed.py in _mul_sparse_matrix(self, other) 494 maxval=nnz) 495 indptr = np.asarray(indptr, dtype=idx_dtype) --> 496 indices = np.empty(nnz, dtype=idx_dtype) 497 data = np.empty(nnz, dtype=upcast(self.dtype, other.dtype)) 498

MemoryError:

from sklearn.cluster import KMeans

num_clusters = 7

km = KMeans(n_clusters=num_clusters)

%time km.fit(tfidf_matrix)

clusters = km.labels_.tolist() Wall time: 15.1 s from sklearn.externals import joblib

#uncomment the below to save your model #since I've already run my model I am loading from the pickle

joblib.dump(km, 'doc_cluster.pkl')

km = joblib.load('doc_cluster.pkl') clusters = km.labels_.tolist()

abc = { 'ContentID': abc1, 'ContentDesc': abc2, 'Cluster': clusters }

frame = pd.DataFrame(abc, index = [clusters] , columns = ['ContentID','Cluster']) frame['Cluster'].value_counts() #number of films per cluster (clusters from 0 to 4)

Out[17]: 1 193418 0 82031 Name: Cluster, dtype: int64

from future import print_function

print("Top terms per cluster:") print() #sort cluster centers by proximity to centroid order_centroids = km.cluster_centers_.argsort()[:, ::-1]

for i in range(num_clusters):

print("Cluster %d words:" % i, end='')    
for ind in order_centroids[i, :3]: #replace 6 with n words per cluster
    print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=',')
print() #add whitespace
print() #add whitespace

print() print() Top terms per cluster:

Cluster 0 words: b"'s",

Cluster 1 words: b"'s",

Cluster 2 words: b"'s",

Cluster 3 words: b"'s",

Cluster 4 words: b"'s",

Cluster 5 words: b"'s",

Cluster 6 words: b"'s",

Jul 02 '17 19:07 MaheshwaranK

The issue is somewhere with this code:

tfidf_vectorizer = TfidfVectorizer(max_df=0.80, max_features=200000,
min_df=0.20, stop_words='english',
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

%time tfidf_matrix = tfidf_vectorizer.fit_transform(abc2) #fit the vectorizer to synopses

print(tfidf_matrix.shape)
Wall time: 4min 33s
(275449, 1)

Notice that the tfidf_matrix shape is (n, 1) where n is the length of abc2. 1 in this case should be much higher--it should represent terms found within the documents matching the specifications. I'm not sure what's happening here but that is the problem. It is possible that nothing is matching your parameters for min_df and max_df. You should try lowering and raising them respectively.

Jul 02 '17 19:07 brandomr

DId you resolve this? If so let me know so I can close the issue

Jul 07 '17 02:07 brandomr

document_cluster
document_cluster copied to clipboard

Unable to get the top n words nearest to the cluster centroid.

here I define a tokenizer and stemmer which returns the set of stems in the text that it is passed

from sklearn.metrics.pairwise import cosine_similarity dist = 1 - cosine_similarity(tfidf_matrix)

document_cluster document_cluster copied to clipboard

Unable to get the top n words nearest to the cluster centroid.

here I define a tokenizer and stemmer which returns the set of stems in the text that it is passed

from sklearn.metrics.pairwise import cosine_similarity dist = 1 - cosine_similarity(tfidf_matrix)

document_cluster
document_cluster copied to clipboard