SPINE icon indicating copy to clipboard operation
SPINE copied to clipboard

nonzero elements are always in same dimensions

Open Narabzad opened this issue 4 years ago • 6 comments

I applied SPINE on top of some embedding and I got sparse embeddings but nonzero elements are exactly in the same dimensions always in almost all the samples. I attached an image to this email that might help to understand the problem better.

Do you have any ideas why this is happening? image

Narabzad avatar Oct 13 '20 22:10 Narabzad

Hi, could you please provide some more information : 1) What are 'some embeddings'  ? Are they standard word2vec or glove embeddings, or are they from some other source ?2) What are words in the image you have attached ?3) On what data did you train the SPINE model ? - e.g. did you use SPINE model trained on word2vec?

harsh19 avatar Oct 19 '20 04:10 harsh19

Actually, they are not some word embeddings. They are RepBert representation of some MSMArco documents. In other words, they are fine-tuned versions of BERT, for some passages. No glove or word2vec trained model was used. After reading the paper, I was thinking that I could apply the SPINE method on any embedding space to make them more sparse and to have a better embedding ( interpretable wise). What do you think about this? Is it not possible?

Narabzad avatar Oct 19 '20 14:10 Narabzad

Got it, thanks for clarifying. If I understood it correctly, you train the model on RepBert embeddings of MSMarco documents. Did you play around with hyperparameters ? The hyperparameter values in this codebase are suggested settings for GloVe and word2vec embeddings, and would probably need some adjustment when you try on other word embeddings.

harsh19 avatar Oct 20 '20 02:10 harsh19

Yes, exactly. I played around a little bit with a number of dimensions and sparsity rates. but still, nothing changed and I still got all the nonzero values always on specific dimensions. Which of the hyperparameters do you suggest to play more?

Narabzad avatar Oct 21 '20 16:10 Narabzad

Hi @Narabzad, thanks for your interest. If your starting representations are very "similar" to each other, it might happen that the resulting embeddings are also high/low in the same dimensions (in the extreme case, if all the starting representations are exactly the same for each document, then you would result in such a case).

One simple thing to try here might be to have a very high coefficient on the reconstruction loss, so even small reconstruction errors are penalized. If your resulting embeddings look similar (like in the picture you shared), you wouldn't be able to reconstruct perfectly, so penalizing heavily on reconstruction loss might prevent this behaviour. Try this with alongside a high coefficient for the PSL loss (so that the values are pushed towards 0 and 1).

danishpruthi avatar Oct 21 '20 16:10 danishpruthi

not all of the documents are very similar, but I can say each group of documents might be highly similar since they have been retrieved for a specific query.

I will try to increase the coefficient for PSL loss. I will keep you posted on this. I highly appreciate your time and effort.

Narabzad avatar Oct 23 '20 21:10 Narabzad