w2vec-similarity
w2vec-similarity copied to clipboard
Start from a bunch of text documents
I have a bunch of text documents in the form of txt files. How do I use this to cluster them?
Hi, this is old and not well-written code, I wasn't expecting anyone to use it. The following functions are relevant to you -
from train.train import train_from_files
from evaluate.evaluate import evaluate_clusters
train_from_files
takes a list of files as input and trains and stores a word2vec model for every single input file in the directory models/
(you might to have create it first). Also, you'd need to pass separate=True
to train_from_files
.
evaluate_clusters
takes as input a list of model files (the ones obtained from the previous step) and computes clusters from the loaded models, and stores output in a file in the directory models/similarity_matrices/new/
(you'd probably need to create this directory too)
Apologies for the convoluted and hard-to-use code.
Also note that since an individual model is trained on every single file, you'd need to have documents of a large length to learn good representations. In the original task, these were novels from Project Gutenberg (40,000-400,000 words per document)
Ok, sorry to bother you but could you please refactor it and provide with some sample code to train and evaluate?
Sorry, I won't be able to get to it this month, at least. You're welcome to try your hand at it and I'd be glad to help you out in case you face issues. I'll also be sure to update this issue once I refactor the code.