allennlp icon indicating copy to clipboard operation
allennlp copied to clipboard

Add read pre-trained bin file to Embedding

Open xdwang0726 opened this issue 4 years ago • 6 comments

For now, Embedding only take .txt and .hdf5 for pre-trained embedding format. Would it be possible to add .bin format as .bin is the most commonly used for pre-trained format. Thank you!

xdwang0726 avatar May 27 '20 18:05 xdwang0726

It's certainly possible and shouldn't be too difficult. Contributions welcome!

epwalsh avatar May 27 '20 19:05 epwalsh

@xdwang0726 can you give a clear example of the type of file you are proposing adding support for? We're not sure what .bin format is, and it'd be good for us to understand the format before anyone begins implementation on adding support.

schmmd avatar May 29 '20 22:05 schmmd

I have encountered issues (not in AllenNLP but python in general) where .bin files would not load in python because they were made on a different os than the one I was using. Mostly it was Windows vs. Linux, but I even had issues in WSL vs. pure Linux. So any implementation may have to deal with this.

Also, @schmmd, I would assume the .bin he refers to is just pickled data from python's pickle module. It could be completely wrong, though.

gabeorlanski avatar Feb 15 '21 21:02 gabeorlanski

For example, the google pretrained word2vec is in .bin file (GoogleNews-vectors-negative300.bin)

xdwang0726 avatar Feb 21 '21 14:02 xdwang0726

@xdwang0726 I see that the .bin you have pointed to is from https://code.google.com/archive/p/word2vec/downloads I remembered trying word2vec from gensim in 2016 I think.

I can see that gensim comment to load the bin file indicates that it is in a Word2Vec only C-based format

https://github.com/RaRe-Technologies/gensim/blob/ee3d6fd1e33fe39fc7aa31ebd56bd63b1a2a2ed6/gensim/models/keyedvectors.py#L1841. and specifically https://github.com/RaRe-Technologies/gensim/blob/ee3d6fd1e33fe39fc7aa31ebd56bd63b1a2a2ed6/gensim/models/keyedvectors.py#L1841

Can you point to any other case of such .bin vectors or/and somewhere where it can be figured out what exactly the bin is formatted as. Is it pickle like @gabeorlanski indicated (which with the above word2vec case is not)?

One can always write one-time script to convert from word2vec bin to desired hdf5 or/and text.

I was thinking of picking this up in the coming week and hence the question.

ghost avatar Mar 30 '21 12:03 ghost

I am also interested in getting this done, but without a clear indication of what exactly the format is, I don't see how we can do it.

dirkgr avatar Apr 05 '21 21:04 dirkgr