GloVe
GloVe copied to clipboard
expand embedding with a smaller datased
Hello, thanks for the software and the datasets, very interesting!
I'm curious to know whether are there tools (or papers) dealing with the problem of expanding a huge collection of word embeddings like the ones available in this project download page with smaller ones build from a smaller project-specific corpus (like a mailbox or a collection of articles on a specific topic) assuming they have the same dimensionality.
It could be useful to exploit the computational effort to generate such dataset and add ones for a specific jargon or use case.
Yeah, this is a good question. I would recommend merging the two vocabularies together to the extent that it is possible first. In the new joint vocabulary, set the word vector of each word to be [u v] by concatenation, where u is the word vectors from corpus 1, and v is the word vectors from corpus 2. If these are absent, set the vector to some random initialized vector, or perhaps the
I'm not sure if I understood. Let's say I use the huge word vector file in the download page, built on the Common Crawl corpus. Then I use the program to build my own vector file based on a much smaller corpus on a narrow topic, e.g. Ferrari cars.
Now I expect this smaller corpus to contain many terms, specific to the topic, not present in the bigger vector file (e.g. "California T"), so I want to use the common ones ("car", "engine", "the", etc.) to recalculate the vectors of these topic-specific words and place them in the correct position in term space.
This way I could have a vector file to be applied to a specific use case with both the great amount of words in the Common Crawl one and the field-specific terms in my smaller dataset, and the relationships between vectors and meanings could hold true among the "merged" file (so "California T" would end with a vector close to other car models already in Common Crawl).
I was thinking about applying logistic regression to the common terms and calculate the new position of the unknown ones, but I imagine this kind of problem has already been examined in order to expand the vector file incrementally.
When you say:
to recalculate the vectors of these topic-specific words
note that this is different than my recommendation. You can't simply recalculate the old vectors for those terms without losing the original information. My recommendation is that if you have 2 different corpuses, with word vectors U, and V, respectively, you first calculate some canonical vocabulary. Inside this new joint vocabulary, each word w_i will have either a word vector u_i in U or v_i in V, but not necessarily both. This means you will have "holes" in the system:
i i+1 i+2
u u+1 -
- v+1 v+2
In this case, word i has a vector in U, but not in V. Word i+1 has a vector in both corpuses. And word i+2 has a vector only in V.
I'm simply suggesting that you fill in every "-" with a consistent random vector, and then concatenate u_i and v_i to create a new word vector. Thus if every u is 300-D, and every v is 200-D, the result will be 500-D. This all assumes you're trying to merge vectors for the purpose of some downstream classifier, such as a Random Forest.