w2vec-similarity icon indicating copy to clipboard operation
w2vec-similarity copied to clipboard

Start from a bunch of text documents

Open somnathrakshit opened this issue 7 years ago • 4 comments

I have a bunch of text documents in the form of txt files. How do I use this to cluster them?

somnathrakshit avatar Jun 08 '17 13:06 somnathrakshit

Hi, this is old and not well-written code, I wasn't expecting anyone to use it. The following functions are relevant to you -

from train.train import train_from_files
from evaluate.evaluate import evaluate_clusters

train_from_files takes a list of files as input and trains and stores a word2vec model for every single input file in the directory models/ (you might to have create it first). Also, you'd need to pass separate=True to train_from_files.

evaluate_clusters takes as input a list of model files (the ones obtained from the previous step) and computes clusters from the loaded models, and stores output in a file in the directory models/similarity_matrices/new/ (you'd probably need to create this directory too)

Apologies for the convoluted and hard-to-use code.

jayantj avatar Jun 15 '17 19:06 jayantj

Also note that since an individual model is trained on every single file, you'd need to have documents of a large length to learn good representations. In the original task, these were novels from Project Gutenberg (40,000-400,000 words per document)

jayantj avatar Jun 15 '17 19:06 jayantj

Ok, sorry to bother you but could you please refactor it and provide with some sample code to train and evaluate?

somnathrakshit avatar Jun 16 '17 06:06 somnathrakshit

Sorry, I won't be able to get to it this month, at least. You're welcome to try your hand at it and I'd be glad to help you out in case you face issues. I'll also be sure to update this issue once I refactor the code.

jayantj avatar Jun 16 '17 09:06 jayantj