sherlock-project icon indicating copy to clipboard operation
sherlock-project copied to clipboard

Generation of paragraph vector files

Open varnitdixit opened this issue 2 years ago • 9 comments

Hi,

Can you please help me understand how are you generating the first, second and third paragraph vector numpy files mentioned in the 'preprocessing.py' file.

par_vec_trained_400.pkl.trainables.syn1neg.npy par_vec_trained_400.pkl.docvecs.vectors_docs.npy par_vec_trained_400.pkl.wv.vectors.npy

image

varnitdixit avatar Feb 06 '23 10:02 varnitdixit

Hi! These can be generated with the notebook here: https://github.com/mitmedialab/sherlock-project/blob/master/notebooks/01-data-preprocessing.ipynb.

Please let me know if this is clear and works for you!

Madelon

madelonhulsebos avatar Feb 06 '23 11:02 madelonhulsebos

Hi, Thanks for the quick response. I saw that we can download these files from : https://github.com/mitmedialab/sherlock-project/blob/master/notebooks/01-data-preprocessing.ipynb.

But, my actual question how do you create these files at first place. How can I re-create these files instead of downloading.

I hope my question makes sense to you.

Varnit

varnitdixit avatar Feb 06 '23 11:02 varnitdixit

Hi Varnit,

These files can be obtained by extracting paragraph vectors again with the code in this module: https://github.com/mitmedialab/sherlock-project/blob/master/sherlock/features/paragraph_vectors.py. The process is displayed here: https://github.com/mitmedialab/sherlock-project/blob/master/notebooks/03-retrain-paragraph-vector-features.ipynb.

Does that address your question?

Regards, Madelon

madelonhulsebos avatar Feb 06 '23 12:02 madelonhulsebos

Hi Madelon,

Thanks for the response again. I looked into the code that you mentioned and it seems it's only generating the pkl file i.e. par_vec_trained_400.pkl. I actually want to know how you are creating these specific npy files:

par_vec_trained_400.pkl.trainables.syn1neg.npy par_vec_trained_400.pkl.docvecs.vectors_docs.npy par_vec_trained_400.pkl.wv.vectors.npy

I'm not able to find the logic that writes/creates these 3 files through code. It would be a great help if you can guide me to that code piece so that I can re-create it for a different dataset.

Regards, Varnit

varnitdixit avatar Feb 06 '23 16:02 varnitdixit

Hi Varnit,

These additional files are generated by gensim automatically through the .save() method, if the model is rather large. These files are then also expected to exist upon loading the model for inference through .load() (these methods are called in the sherlock.features.paragraph_vectors.py module). These files will be created automatically for your newly trained model (if it exceeds a certain size) once you extract paragraph vectors for your own training dataset. If these are not recreated for your model, then I assume you do not need them. Let me know if you run into issues..

Regards, Madelon

madelonhulsebos avatar Feb 08 '23 09:02 madelonhulsebos

Hi Madelon, Congratulations for the superlative work done. I have similar questions to those asked by varnitdixit, specifically:

  • since I am working with custom data, it is important to retrain the paragraph vector model, correct?
  • If so, is the right time to do this before the features exctraction part?

I followed this work pipeline:

  1. I create the parquet files with my data
  2. I retrain the paragraph vector model with my data (using the code in 03-retrain-paragraph-vector-features.ipynb)
  3. I run the features extraction process (using the code found in 01-data-preprocessing.ipynb)
  4. I retrain the model (02-1-train-and-test-sherlock.ipynb).

I do not understand in which of these steps the files par_vec_trained_400.pkl.trainables.syn1neg.npy, par_vec_trained_400.pkl.docvecs.vectors_docs.npy, par_vec_trained_400.pkl.wv.vectors.npy, should be created and what is their usefulness.

Best regards, Giacomo

GiacomoPracucci avatar Feb 05 '24 13:02 GiacomoPracucci

Hi Giacomo,

Thank you!

  • I recommend building a new paragraph vector model, but you can try if the existing works for your dataset.
  • Indeed, this should be done before the feature extraction process.

Your pipeline looks OK. Were you not able to extract the features? What is the error message you got? You don't need to create those files yourself, they will be created by gensim automatically if needed.

Let me know if you have any other questions!

Best, Madelon

madelonhulsebos avatar Feb 07 '24 18:02 madelonhulsebos

Thank you for your response Madelon, forgive me if I'm not clear, it's probably because I don't understand what those files are for.

Actually, I don't get any errors, I can create the datatasets, train the paragraph vector model on my data, extract the features and train the model.

My doubt about those files came from the fact that when I retrain the paragraph vector model, a new updated par_vec_trained_400.pkl file is created (I can see it from the last modification date of the file), while the other files I mentioned to you are not updated and remain the same as downloaded from the links given in sherlock/features/preprocessing.py.

So probably the correct question is: should these files also update if I work with custom data, or are the original ones downloaded from the indicated google drive links fine?

You wrote that gensim creates them automatically "if needed": what is the requirement to create new ones?

Thank you again, Giacomo

GiacomoPracucci avatar Feb 07 '24 20:02 GiacomoPracucci

Hi Giacomo,

No problem, great to hear that you got it working on your dataset!

Gensim only creates the auxiliary files if the dataset exceeds a certain size (I am not sure what the cutoff is). If the auxiliary files are not overwritten, then I assume your dataset is of a size that it can be stored without those auxiliary files (it uses different storage methods for smaller versus larger datasets).

So, no need to worry about not having these files modified. The original auxiliary files should not be used for feature extraction for your custom data. You can consider (re)moving those auxiliary files, just to be sure gensim doesn't load them (I am not sure how gensim is implemented internally).

Regards Madelon

madelonhulsebos avatar Feb 08 '24 18:02 madelonhulsebos