Text-GCN icon indicating copy to clipboard operation
Text-GCN copied to clipboard

Node label one hot encoded

Open matteomedioli opened this issue 4 years ago • 2 comments

Hi, I'm working with Wordnet and Graph Neural Network. How it was possible to encode in one hot encoding format a complete vocabulary? For example, I have 250k different words. How many words do you use for your model? Thanks fin advance!

matteomedioli avatar Feb 23 '21 21:02 matteomedioli

A one hot init on the complete vocabulary would be something like the identity matrix for all words in your vocabulary so 250k nodes. The function that does this in the code is init_node_feats. The number of words for the datasets in the repo can be see in data/corpus and looking at the {dataset}_vocab.txt files. For example, for the r8_presplit dataset there are 7688 nodes. This vocabulary is built from the sentence and words of frequency greater than 5 are kept. See the build_text_graph_dataset function. However, it must also be noted that 250k nodes with one hot initialization would not be feasible as each node would have a initial feature dimension of 250k and the graph would not be able to be loaded in system memory.

codeKgu avatar Feb 24 '21 19:02 codeKgu

lets assume i have 5000 documents and their 5000 integer labels and in this corpus we got 14000 unique words. according to paper total num of nodes will be ==> total documents + vocab size = 5000+14000= 19000 nodes but for documents we know the labels ,how are you creating the labels for vocab word (nodes) can u clarify on this

riyaj8888 avatar Jul 25 '22 07:07 riyaj8888