HGAT How to use unlabeled data

The paper says "all the left documents are for testing, which are also used as unlabeled documents during training", but I did't find the explanation of how to use unlabeled data. Could anyone make sense of it?

Dec 11 '20 13:12 JaniceXiong

From model/code/train.py:

We can see that for training, all the nodes features are used. Which includes documents used for training, validation and test as well. But for calculating loss while training, only training dataset and validation dataset documents are used. Therefore, documents used for testing are also used as unlabeled documents during training. (as document nodes in the graph)

Dec 11 '20 14:12 abhinab303

But if we really need to include unlabeled documents that won't be used for testing either, we need to change codes on model/code/utils.py and perhaps build_data.py(if we use this file to create our graph).

Dec 11 '20 14:12 abhinab303

But if we really need to include unlabeled documents that won't be used for testing either, we need to change codes on model/code/utils.py and perhaps build_data.py(if we use this file to create our graph).

Hello, I would like to ask if you have used the data set you found to compose the picture yourself. After composing the data set by myself, the effect I got is much worse than that of the paper.

Apr 26 '21 07:04 bp20200202

But if we really need to include unlabeled documents that won't be used for testing either, we need to change codes on model/code/utils.py and perhaps build_data.py(if we use this file to create our graph).

Hello, I would like to ask if you have used the data set you found to compose the picture yourself. After composing the data set by myself, the effect I got is much worse than that of the paper.

Which dataset are you using? Using Agnews dataset worked for us. Even got a better f-score. You need careful preprocessing such that TagMe works properly. What is the train_per_class value? Also sanity check your number of nodes and edges values.

Apr 26 '21 09:04 abhinab303

But if we really need to include unlabeled documents that won't be used for testing either, we need to change codes on model/code/utils.py and perhaps build_data.py(if we use this file to create our graph).

Hello, I would like to ask if you have used the data set you found to compose the picture yourself. After composing the data set by myself, the effect I got is much worse than that of the paper.

Which dataset are you using? Using Agnews dataset worked for us. Even got a better f-score. You need careful preprocessing such that TagMe works properly. What is the train_per_class value? Also sanity check your number of nodes and edges values.

Both agnews and ohsumed can only get half of the f1 value in the paper. For agnews, only when train_per_class is equal to 400, the result can barely reach 70. Are you using the data set that has not been specially processed? I have not found out which error caused the low f1 value.

Apr 27 '21 01:04 bp20200202

@bp20200202 I just remember that we also made one mistake of only using the Titles instead of Description for the Agnews Dataset. Maybe you are also using the titles only.

Here is one simple preprocessing example that we used:

# remove text inside "()"
def remove_bracket(text):
  return re.sub(' +',' ',re.sub(r'\([^)]*\)', '', text))

# remove inside bracket, Tokenize, Stop word removal, 
def preprocess(text):
  result=[]
  text = remove_bracket(text)
  for token in gensim.utils.simple_preprocess(text) :
      result.append(token)
  result = " ".join(str(x) for x in result) 
  return result

Try this dataset: agnews_with_document_preprocessed_3200.txt agnews_with_document_preprocessed.txt

Apr 28 '21 08:04 abhinab303

@bp20200202 I just remember that we also made one mistake of only using the Titles instead of Description for the Agnews Dataset. Maybe you are also using the titles only.

Here is one simple preprocessing example that we used:
# remove text inside "()"
def remove_bracket(text):
  return re.sub(' +',' ',re.sub(r'\([^)]*\)', '', text))

# remove inside bracket, Tokenize, Stop word removal, 
def preprocess(text):
  result=[]
  text = remove_bracket(text)
  for token in gensim.utils.simple_preprocess(text) :
      result.append(token)
  result = " ".join(str(x) for x in result) 
  return result
Try this dataset: agnews_with_document_preprocessed_3200.txt agnews_with_document_preprocessed.txt

thanks a lot,i will try it again。

Apr 28 '21 13:04 bp20200202

HGAT HGAT copied to clipboard

How to use unlabeled data

HGAT
HGAT copied to clipboard