document_cluster
document_cluster copied to clipboard
Printing Clusters (Top terms & titles)
I've followed all the steps down to the final one where you print the top terms per cluster, together with the film titles. I'm using a slightly different dataset (blog titles and blog post content) but in essence my data is the same as yours, although my data is already in a dataframe, so where you call on 'synopses', I call df.Content. The one step I couldn't do was the one where you grouped the rank by clusters as obviously this doesn't apply to me. I want ten clusters from my data.
Here, you create a dictionary:
films = { 'title': titles, 'rank': ranks, 'synopsis': synopses, 'cluster': clusters, 'genre': genres } frame = pd.DataFrame(films, index = [clusters] , columns = ['rank', 'title', 'cluster', 'genre'])
But as I already have a dataframe, I reindex-ed using clusters. The problem is, only the first ten blog post titles are being used, as this screenshot shows:

As this is my first attempt at kMeans (although I've been experimenting with my data for three weeks) I'm not yet clever enough to work out what's going wrong. Any ideas? Thanks in advance!
Actually, I HAVE solved the problem, but now I have another.
I did this:
...but when I run the final step, (printing the top terms and titles in each cluster) I get the following error message:

I'm at a loss again.... Thanks!
@s2hewitt it looks like Title in your frame object is not an array but a string. In the frame I referenced Title is an array of film titles associated with the cluster. It looks like you have one row per title and an associated cluster.
If you want to get all the titles for a given cluster (assuming the above is true) you can do something like:
import pandas as pd
data = [
{'Title': 'film 1', 'cluster': 0},
{'Title': 'film 2', 'cluster': 0},
{'Title': 'film 3', 'cluster': 1},
{'Title': 'film 4', 'cluster': 1},
{'Title': 'film 5', 'cluster': 1},
{'Title': 'film 6', 'cluster': 2},
{'Title': 'film 7', 'cluster': 2},
{'Title': 'film 8', 'cluster': 2}
]
frame = pd.DataFrame(data)
# get unique list of clusters
clusters = list(set(frame.cluster))
# iterate over list of clusters
for clust in clusters:
# subset frame based on cluster then grab those titles
cluster_titles = ', '.join(frame[frame['cluster'] == clust].Title.tolist())
print('Cluster {0} Titles: {1}\n'.format(clust, cluster_titles))
Let me know if that helps!
Thanks. It'll take me a good while to work this out.
My dataframe has nearly 4,000 blog post titles and associated content. I'm trying this out on this sample - my final .csv file is much, much bigger.
I think I need to go back and see if I can replicate how you created and converted the dictionary, although I'm guessing that you worked from lists which isn't practical for the file sizes I'm eventually going to be working with.
@s2hewitt if you're able to post your notebook and some sample data I could take a look; if you're not getting out of memory errors then the data size isn't a problem you might just need some fancy footwork to convert the dataframe into a format that give you your desired output.
Thanks! Could you follow me on Twitter so I can DM you? This is for my PhD, and while I'm very willing to share solutions with anyone else who may encounter the same issues, I'm a bit protective of my data!