retrofitting Output size smaller than original

Output size smaller than original

Open zharenkov opened this issue 5 years ago • 5 comments

Hi, @mfaruqui I'm passing to retrofit.py glove's embedding file 840B.300d. Its size is about 5,5gb, but result file's size is 3.7gb (for wordnet and for paraphrase). Is it correct behaviour? If yes - can you please explain why size is decresing so significantly?

Thanks!

Jul 02 '19 13:07 zharenkov

hi @zharenkov, hi @mfaruqui, i'm having the same issue: when comparing the original embeddings file to the retrofitted one, around 5% of the lines are lost and i'm wondering why. (retrofitting the same word embedding with different lexicons results in the exact same decreased number of lines for each lexicon) cheers!

Jul 14 '19 12:07 alina-le

Line #44 in the code is truncating the float to only 4 digits after decimal. If the total number of words in the input and output are same, this is fine.

Jul 14 '19 13:07 mfaruqui

thanks for the answer @mfaruqui! figured out it was due to words in the original file being contained in upper as well as in lowercase, while the retrofitted embeddings are all lowercase

Jul 14 '19 15:07 alina-le

I'm losing around 3% of vectors when retrofitted. I've checked for the 56 missing vectors out of 2070 input vectors and it's not a case of lowercase-uppercase duplicates. Can you please advise on what this possibly could be? Cheers!

Jun 13 '22 14:06 japleengulati

To detail on my issue and clarify - the 56 vectors themselves aren't missing but they're missing dimensions! I input 2070 vectors of 300 dimensions each. In the output I received the same number of vectors but 56 of them with missing dimensions so they had dimensions like 296,294, etc. It does not seem like a case of formatting gone wrong either, I've checked.

Jun 13 '22 18:06 japleengulati

retrofitting retrofitting copied to clipboard

Output size smaller than original

retrofitting
retrofitting copied to clipboard