VGCN-BERT Combining other type of embeddings

Hi Louis,

I want to combine embeddings extracted from a different source with 1) VGCN-BERT embeddings and 2) with only BERT embeddings.

For this I modified the function forward() in VGCNBertEmbeddings class and just added the embeddings to 1 as:

embeddings = gcn_words_embeddings + position_embeddings + token_type_embeddings + other_embeddings

and to 2 as:

embeddings = words_embeddings + position_embeddings + token_type_embeddings + other_embeddings

Is this the right way? Are there other parts to take into account?

Thanks in advance.

Oct 22 '20 13:10 mbarbouch

Hi,

It seems it's at least one of the right ways.

Oct 23 '20 03:10 Louis-udm

Thanks for your reply!

I could run the model with that addition, but it didn't make much difference in the end score.

Now I am trying to adjust BERT's embeddings, like the way you did it:

`gcn_vocab_out = self.vocab_gcn(vocab_adj_list, vocab_input)

        gcn_words_embeddings=words_embeddings.clone()
        for i in range(self.gcn_embedding_dim):
            tmp_pos=(attention_mask.sum(-1)-2-self.gcn_embedding_dim+1+i)+torch.arange(0,input_ids.shape[0]).to(input_ids.device)*input_ids.shape[1]
            gcn_words_embeddings.flatten(start_dim=0, end_dim=1)[tmp_pos,:]=gcn_vocab_out[:,:,i]`

However, some parts are unclear to me:

gcn dimension is 16, while BERT's is 768. How do you project the smaller gcn vectors to word embeddings?
the number of words can differ per sentence. How do you know which word embeddings do you need to adjust if the size of gcn is always 16?

I've yet another different question about word coverage from external sources. In case you couldn't find a word from your corpus, then we can't calculate the embedding for that word. Do you have any idea's how to cope with that limitation? (I was thinking of averaging the word embeddings found and turn them to one single (sentence) emebdding, but I don't know how to relate the single vector to multiple word embedding vectors of BERT!)

Oct 23 '20 11:10 mbarbouch

Hi,

Were you able to combine VGCN embedding with BERT? How do you load gcn embedding model to begin with?

Thanks!

Dec 01 '20 00:12 nargesam

Hi,

Were you able to combine VGCN embedding with BERT? How do you load gcn embedding model to begin with?

Thanks!

Hi Nargesam,

Well, by default the model makes use of VGCN embeddings. These are combined in https://github.com/Louis-udm/VGCN-BERT/blob/e5f642ab8a53478d3fa52dc7c8d7f91f7e62055e/model_vgcn_bert.py#L179. That is also the place where I managed to put my embeddings.

Dec 02 '20 15:12 mbarbouch

Thanks for your reply!

I could run the model with that addition, but it didn't make much difference in the end score.

Now I am trying to adjust BERT's embeddings, like the way you did it:

`gcn_vocab_out = self.vocab_gcn(vocab_adj_list, vocab_input)
        gcn_words_embeddings=words_embeddings.clone()
        for i in range(self.gcn_embedding_dim):
            tmp_pos=(attention_mask.sum(-1)-2-self.gcn_embedding_dim+1+i)+torch.arange(0,input_ids.shape[0]).to(input_ids.device)*input_ids.shape[1]
            gcn_words_embeddings.flatten(start_dim=0, end_dim=1)[tmp_pos,:]=gcn_vocab_out[:,:,i]`
However, some parts are unclear to me:

gcn dimension is 16, while BERT's is 768. How do you project the smaller gcn vectors to word embeddings?

the number of words can differ per sentence. How do you know which word embeddings do you need to adjust if the size of gcn is always 16?

Hi, Sorry for the late reply. the gcn dimension is also 768 not 16, the 16 is the hyper-parameter that means 16 gcn-related-words. The gcn-related-word can be seen as a integrated word w.r.t diff weights of words from the graph.

I've yet another different question about word coverage from external sources. In case you couldn't find a word from your corpus, then we can't calculate the embedding for that word. Do you have any idea's how to cope with that limitation? (I was thinking of averaging the word embeddings found and turn them to one single (sentence) emebdding, but I don't know how to relate the single vector to multiple word embedding vectors of BERT!)

The method is already in the BERT source. the BERT split a new word into a head token and some ##xxx tokens. Of course, you can add your new words into the original vocabulary but I think you will have to do the pre-train at first.

Jan 07 '21 04:01 Louis-udm

Thanks for your reply! I could run the model with that addition, but it didn't make much difference in the end score. Now I am trying to adjust BERT's embeddings, like the way you did it: `gcn_vocab_out = self.vocab_gcn(vocab_adj_list, vocab_input)
        gcn_words_embeddings=words_embeddings.clone()
        for i in range(self.gcn_embedding_dim):
            tmp_pos=(attention_mask.sum(-1)-2-self.gcn_embedding_dim+1+i)+torch.arange(0,input_ids.shape[0]).to(input_ids.device)*input_ids.shape[1]
            gcn_words_embeddings.flatten(start_dim=0, end_dim=1)[tmp_pos,:]=gcn_vocab_out[:,:,i]`
However, some parts are unclear to me:

gcn dimension is 16, while BERT's is 768. How do you project the smaller gcn vectors to word embeddings?

the number of words can differ per sentence. How do you know which word embeddings do you need to adjust if the size of gcn is always 16?
Hi, Sorry for the late reply. the gcn dimension is also 768 not 16, the 16 is the hyper-parameter that means 16 gcn-related-words. The gcn-related-word can be seen as a integrated word w.r.t diff weights of words from the graph.

I've yet another different question about word coverage from external sources. In case you couldn't find a word from your corpus, then we can't calculate the embedding for that word. Do you have any idea's how to cope with that limitation? (I was thinking of averaging the word embeddings found and turn them to one single (sentence) emebdding, but I don't know how to relate the single vector to multiple word embedding vectors of BERT!)

The method is already in the BERT source. the BERT split a new word into a head token and some ##xxx tokens. Of course, you can add your new words into the original vocabulary but I think you will have to do the pre-train at first.

Hi Louis,

Great Repo.

I was just wondering how did you come up with the adjacency matrix calculation in the data prepare file. like pmi and tf_idf. Are there any other word relations can we add apart from these two? How did you come up with these two? any theory? any resources?

Thankyou!

Jan 23 '21 16:01 jaytimbadia

@jaytimbadia Thank you for your attention. Both the GCN adjacency matrix and PMI are based on previous works, you can find many GCN-related papers and PMI papers. You can try other word relationships, such as Wordnet. These are all mentioned in my paper.

Jan 23 '21 17:01 Louis-udm

VGCN-BERT VGCN-BERT copied to clipboard

Combining other type of embeddings

VGCN-BERT
VGCN-BERT copied to clipboard