retrofitting
retrofitting copied to clipboard
Output size smaller than original
Hi, @mfaruqui I'm passing to retrofit.py glove's embedding file 840B.300d. Its size is about 5,5gb, but result file's size is 3.7gb (for wordnet and for paraphrase). Is it correct behaviour? If yes - can you please explain why size is decresing so significantly?
Thanks!
hi @zharenkov, hi @mfaruqui, i'm having the same issue: when comparing the original embeddings file to the retrofitted one, around 5% of the lines are lost and i'm wondering why. (retrofitting the same word embedding with different lexicons results in the exact same decreased number of lines for each lexicon) cheers!
Line #44 in the code is truncating the float to only 4 digits after decimal. If the total number of words in the input and output are same, this is fine.
thanks for the answer @mfaruqui! figured out it was due to words in the original file being contained in upper as well as in lowercase, while the retrofitted embeddings are all lowercase
I'm losing around 3% of vectors when retrofitted. I've checked for the 56 missing vectors out of 2070 input vectors and it's not a case of lowercase-uppercase duplicates. Can you please advise on what this possibly could be? Cheers!
To detail on my issue and clarify - the 56 vectors themselves aren't missing but they're missing dimensions! I input 2070 vectors of 300 dimensions each. In the output I received the same number of vectors but 56 of them with missing dimensions so they had dimensions like 296,294, etc. It does not seem like a case of formatting gone wrong either, I've checked.