document_cluster Printing Clusters (Top terms & titles)

I've followed all the steps down to the final one where you print the top terms per cluster, together with the film titles. I'm using a slightly different dataset (blog titles and blog post content) but in essence my data is the same as yours, although my data is already in a dataframe, so where you call on 'synopses', I call df.Content. The one step I couldn't do was the one where you grouped the rank by clusters as obviously this doesn't apply to me. I want ten clusters from my data.

Here, you create a dictionary:

films = { 'title': titles, 'rank': ranks, 'synopsis': synopses, 'cluster': clusters, 'genre': genres } frame = pd.DataFrame(films, index = [clusters] , columns = ['rank', 'title', 'cluster', 'genre'])

But as I already have a dataframe, I reindex-ed using clusters. The problem is, only the first ten blog post titles are being used, as this screenshot shows:

As this is my first attempt at kMeans (although I've been experimenting with my data for three weeks) I'm not yet clever enough to work out what's going wrong. Any ideas? Thanks in advance!

Jan 03 '17 16:01 s2hewitt

Actually, I HAVE solved the problem, but now I have another. I did this: ...but when I run the final step, (printing the top terms and titles in each cluster) I get the following error message:

I'm at a loss again.... Thanks!

Jan 03 '17 21:01 s2hewitt

@s2hewitt it looks like Title in your frame object is not an array but a string. In the frame I referenced Title is an array of film titles associated with the cluster. It looks like you have one row per title and an associated cluster.

If you want to get all the titles for a given cluster (assuming the above is true) you can do something like:

import pandas as pd

data = [
        {'Title': 'film 1', 'cluster': 0},
        {'Title': 'film 2', 'cluster': 0},
        {'Title': 'film 3', 'cluster': 1},
        {'Title': 'film 4', 'cluster': 1},
        {'Title': 'film 5', 'cluster': 1},
        {'Title': 'film 6', 'cluster': 2},
        {'Title': 'film 7', 'cluster': 2},
        {'Title': 'film 8', 'cluster': 2}
    ]

frame = pd.DataFrame(data)

# get unique list of clusters
clusters = list(set(frame.cluster))

# iterate over list of clusters
for clust in clusters:

    # subset frame based on cluster then grab those titles
    cluster_titles = ', '.join(frame[frame['cluster'] == clust].Title.tolist())

    print('Cluster {0} Titles: {1}\n'.format(clust, cluster_titles))

Let me know if that helps!

Jan 03 '17 22:01 brandomr

Thanks. It'll take me a good while to work this out.
My dataframe has nearly 4,000 blog post titles and associated content. I'm trying this out on this sample - my final .csv file is much, much bigger. I think I need to go back and see if I can replicate how you created and converted the dictionary, although I'm guessing that you worked from lists which isn't practical for the file sizes I'm eventually going to be working with.

Jan 03 '17 22:01 s2hewitt

@s2hewitt if you're able to post your notebook and some sample data I could take a look; if you're not getting out of memory errors then the data size isn't a problem you might just need some fancy footwork to convert the dataframe into a format that give you your desired output.

Jan 04 '17 03:01 brandomr

Thanks! Could you follow me on Twitter so I can DM you? This is for my PhD, and while I'm very willing to share solutions with anyone else who may encounter the same issues, I'm a bit protective of my data!

Jan 04 '17 12:01 s2hewitt

document_cluster document_cluster copied to clipboard

Printing Clusters (Top terms & titles)

document_cluster
document_cluster copied to clipboard