Embeddings.jl ConceptNetNumberbatch word embeddings support

This pull adds support for ConceptNetNumberbatch. Three distinct files formats are available and supported:

[x] multilingual, gzipped .txt file, word and embeddings on each line
[x] english, gzipped .txt file, word and embeddings on each line
[x] multilingual, HDF5, each embedding is a Vector{Int8}

Conceptnet word keys for the multilingual datasets are of the form /c/<language>/word which makes direct acces a bit unwieldy and searching for example for word fails. Also, misspellings i.e. word.,wordd fails as well. A more heuristic method of retrieving the best match would be advised at this point :)

Sep 17 '18 19:09 zgornel

Thanks for this. It might be a little while before I can properly review this. Feel free to ping me if you think I have forgotten.

I think we might want to add more smarts to the return type. I think we can't get away with just returning a struct. We need some methods. We'ld also need this for interpolation of OOV words in #1.

I do not like the use of :compressed as a language. I think that should also use a multilingual marker.

It is also clear to me that as we add more embedding types, the need to parallelize them into separate testing environments grows.

Sep 18 '18 02:09 oxinabox

Thanks for the feedback. A few remarks:

No problems on the time issue; this branch is available anyway for quick use of Conceptnet. If more refinements are required, it can be merged later ;) Some bits have to be improved as well...
More merhods are needed indeed; OOV interpolation is a must, will look into that and backport anything worthwile
Also, support for out-of-language would be a nice feature i.e. get the embedding of an equivalent word fron another language; no ideea how feasible is this; fasttext has language detection so at least OOV words can be correctly interpolated
:compressed is a quick hack
paralelization is also a nice to have however, I see more problematic the fact that the tests download many GBs into a temporary folder; for me it was quite difficult and had to remove existing tests. It would be good if mini-datasets i.e. 15 embeddings would be used; for conceptnet I have crafted such datasets specifically for testing...

Sep 18 '18 08:09 zgornel

paralelization is also a nice to have however, I see more problematic the fact that the tests download many GBs into a temporary folder;

I am not sure what you mean, the tests delete their downloads automatically. If they are not please raise a separate issue. (I think there might be a windows related bug, but I am not sure as I don't use windows.)

for me it was quite difficult and had to remove existing tests. It would be good if mini-datasets i.e. 15 embeddings would be used; for conceptnet I have crafted such datasets specifically for testing...

Yes, ideally, we would just test on mini-datasets. Though that does have the downside of them not being real; it is likely worth it for the faster tests. We'll find out pretty quickly if they are indeed not matching up to the real formats. Feel free to make such a PR.

Sep 18 '18 09:09 oxinabox

I am not sure what you mean, the tests delete their downloads automatically. If they are not please raise a separate issue. (I think there might be a windows related bug, but I am not sure as I don't use windows.)

On many unix-like systems /tmp is mounted through tmpfs in RAM. On lower-memory machines this can be an issue as the ram would get filled with the test's data. So, even though an elegant approach, it has its issues and imho tests should be able to run on low memory machines (can have issues with CI as well as those machines have low resources at their disposal, at least the free ones).

Sep 18 '18 09:09 zgornel

@zgornel Thanks! I'll start taking a look at this too.

fasttext has language detection so at least OOV words can be correctly interpolated

Can you clarify what you mean by this? My understanding is that while it's quite possible to use the fasttext library to train a classifier for a language identification task (like they show here), the pretrained fasttext embeddings themselves are all monolingual- i.e. each language is trained separately and the embedding space is not shared among languages, with any OOV interpolation also being language-specific as it is computed from subword char ngrams. Maybe I'm missing your point, though.

I think we might want to add more smarts to the return type. I think we can't get away with just returning a struct. We need some methods. We'ld also need this for interpolation of OOV words in #1.

I agree. But to me, it seems like this is a separate feature that this PR doesn't (necessarily) depend on. @oxinabox do you have anything specific in mind already? Otherwise, maybe we should open another issue to discuss what a generic API might look like.

Sep 19 '18 04:09 dellison

@zgornel Thanks! I'll start taking a look at this too.
fasttext has language detection so at least OOV words can be correctly interpolated
Can you clarify what you mean by this? My understanding is that while it's quite possible to use the fasttext library to train a classifier for a language identification task (like they show here), ...

That is my understanding too. Languages.jl already has (its own) language detection. (Probably not state of the art, but very workable) And once #6 is done, then that will be easy to use together with it.

I think we might want to add more smarts to the return type. I think we can't get away with just returning a struct. We need some methods. We'ld also need this for interpolation of OOV words in #1.

I agree. But to me, it seems like this is a separate feature that this PR doesn't (necessarily) depend on. @oxinabox do you have anything specific in mind already? Otherwise, maybe we should open another issue to discuss what a generic API might look like.

Yes, Lets. ~~#14~~ #16

Sep 19 '18 05:09 oxinabox

@oxinabox I think that issue link just goes to this PR. Issue is #16

Sep 19 '18 07:09 dellison

fasttext has language detection so at least OOV words can be correctly interpolated

Can you clarify what you mean by this? My understanding is that while it's quite possible to use the fasttext library to train a classifier for a language identification task (like they show here), the pretrained fasttext embeddings themselves are all monolingual- i.e. each language is trained separately and the embedding space is not shared among languages, with any OOV interpolation also being language-specific as it is computed from subword char ngrams. Maybe I'm missing your point, though.

That's it, I was referring to the pretrained model which can be downloaded here. Since the multilingual conceptnet file uses a word of the form /c/<language>/<matchable expression>, detecting the language first can help pre-filter the embeddings before OOV search. In the cases where two words appear in different languages, it can help greatly, as well as with speed.

@oxinabox I was not aware that Languages.jl has language identification, that's great.

Sep 19 '18 08:09 zgornel