GloVe icon indicating copy to clipboard operation
GloVe copied to clipboard

expand embedding with a smaller datased

Open jacopofar opened this issue 8 years ago • 3 comments

Hello, thanks for the software and the datasets, very interesting!

I'm curious to know whether are there tools (or papers) dealing with the problem of expanding a huge collection of word embeddings like the ones available in this project download page with smaller ones build from a smaller project-specific corpus (like a mailbox or a collection of articles on a specific topic) assuming they have the same dimensionality.

It could be useful to exploit the computational effort to generate such dataset and add ones for a specific jargon or use case.

jacopofar avatar Jun 14 '16 15:06 jacopofar

Yeah, this is a good question. I would recommend merging the two vocabularies together to the extent that it is possible first. In the new joint vocabulary, set the word vector of each word to be [u v] by concatenation, where u is the word vectors from corpus 1, and v is the word vectors from corpus 2. If these are absent, set the vector to some random initialized vector, or perhaps the vector. The downstream classifier (e.g. RandomForest) can then make use of information from both sources. Does that make sense?

ghost avatar Jun 23 '16 19:06 ghost

I'm not sure if I understood. Let's say I use the huge word vector file in the download page, built on the Common Crawl corpus. Then I use the program to build my own vector file based on a much smaller corpus on a narrow topic, e.g. Ferrari cars.

Now I expect this smaller corpus to contain many terms, specific to the topic, not present in the bigger vector file (e.g. "California T"), so I want to use the common ones ("car", "engine", "the", etc.) to recalculate the vectors of these topic-specific words and place them in the correct position in term space.

This way I could have a vector file to be applied to a specific use case with both the great amount of words in the Common Crawl one and the field-specific terms in my smaller dataset, and the relationships between vectors and meanings could hold true among the "merged" file (so "California T" would end with a vector close to other car models already in Common Crawl).

I was thinking about applying logistic regression to the common terms and calculate the new position of the unknown ones, but I imagine this kind of problem has already been examined in order to expand the vector file incrementally.

jacopofar avatar Jul 04 '16 09:07 jacopofar

When you say:

to recalculate the vectors of these topic-specific words

note that this is different than my recommendation. You can't simply recalculate the old vectors for those terms without losing the original information. My recommendation is that if you have 2 different corpuses, with word vectors U, and V, respectively, you first calculate some canonical vocabulary. Inside this new joint vocabulary, each word w_i will have either a word vector u_i in U or v_i in V, but not necessarily both. This means you will have "holes" in the system:

i    i+1    i+2
u    u+1     -
-    v+1    v+2

In this case, word i has a vector in U, but not in V. Word i+1 has a vector in both corpuses. And word i+2 has a vector only in V.

I'm simply suggesting that you fill in every "-" with a consistent random vector, and then concatenate u_i and v_i to create a new word vector. Thus if every u is 300-D, and every v is 200-D, the result will be 500-D. This all assumes you're trying to merge vectors for the purpose of some downstream classifier, such as a Random Forest.

ghost avatar Jul 09 '16 17:07 ghost