allennlp
allennlp copied to clipboard
Add read pre-trained bin file to Embedding
For now, Embedding only take .txt and .hdf5 for pre-trained embedding format. Would it be possible to add .bin format as .bin is the most commonly used for pre-trained format. Thank you!
It's certainly possible and shouldn't be too difficult. Contributions welcome!
@xdwang0726 can you give a clear example of the type of file you are proposing adding support for? We're not sure what .bin
format is, and it'd be good for us to understand the format before anyone begins implementation on adding support.
I have encountered issues (not in AllenNLP but python in general) where .bin
files would not load in python because they were made on a different os than the one I was using. Mostly it was Windows vs. Linux, but I even had issues in WSL vs. pure Linux. So any implementation may have to deal with this.
Also, @schmmd, I would assume the .bin
he refers to is just pickled data from python's pickle
module. It could be completely wrong, though.
For example, the google pretrained word2vec is in .bin file (GoogleNews-vectors-negative300.bin)
@xdwang0726 I see that the .bin you have pointed to is from https://code.google.com/archive/p/word2vec/downloads I remembered trying word2vec from gensim in 2016 I think.
I can see that gensim comment to load the bin file indicates that it is in a Word2Vec only C-based format
https://github.com/RaRe-Technologies/gensim/blob/ee3d6fd1e33fe39fc7aa31ebd56bd63b1a2a2ed6/gensim/models/keyedvectors.py#L1841. and specifically https://github.com/RaRe-Technologies/gensim/blob/ee3d6fd1e33fe39fc7aa31ebd56bd63b1a2a2ed6/gensim/models/keyedvectors.py#L1841
Can you point to any other case of such .bin vectors or/and somewhere where it can be figured out what exactly the bin is formatted as. Is it pickle like @gabeorlanski indicated (which with the above word2vec case is not)?
One can always write one-time script to convert from word2vec bin to desired hdf5 or/and text.
I was thinking of picking this up in the coming week and hence the question.
I am also interested in getting this done, but without a clear indication of what exactly the format is, I don't see how we can do it.