HGAT
HGAT copied to clipboard
How to use unlabeled data
The paper says "all the left documents are for testing, which are also used as unlabeled documents during training", but I did't find the explanation of how to use unlabeled data. Could anyone make sense of it?
From model/code/train.py:
We can see that for training, all the nodes features are used. Which includes documents used for training, validation and test as well. But for calculating loss while training, only training dataset and validation dataset documents are used. Therefore, documents used for testing are also used as unlabeled documents during training. (as document nodes in the graph)
But if we really need to include unlabeled documents that won't be used for testing either, we need to change codes on model/code/utils.py and perhaps build_data.py(if we use this file to create our graph).
But if we really need to include unlabeled documents that won't be used for testing either, we need to change codes on model/code/utils.py and perhaps build_data.py(if we use this file to create our graph).
Hello, I would like to ask if you have used the data set you found to compose the picture yourself. After composing the data set by myself, the effect I got is much worse than that of the paper.
But if we really need to include unlabeled documents that won't be used for testing either, we need to change codes on model/code/utils.py and perhaps build_data.py(if we use this file to create our graph).
Hello, I would like to ask if you have used the data set you found to compose the picture yourself. After composing the data set by myself, the effect I got is much worse than that of the paper.
Which dataset are you using? Using Agnews dataset worked for us. Even got a better f-score. You need careful preprocessing such that TagMe works properly. What is the train_per_class value? Also sanity check your number of nodes and edges values.
But if we really need to include unlabeled documents that won't be used for testing either, we need to change codes on model/code/utils.py and perhaps build_data.py(if we use this file to create our graph).
Hello, I would like to ask if you have used the data set you found to compose the picture yourself. After composing the data set by myself, the effect I got is much worse than that of the paper.
Which dataset are you using? Using Agnews dataset worked for us. Even got a better f-score. You need careful preprocessing such that TagMe works properly. What is the train_per_class value? Also sanity check your number of nodes and edges values.
Both agnews and ohsumed can only get half of the f1 value in the paper. For agnews, only when train_per_class is equal to 400, the result can barely reach 70. Are you using the data set that has not been specially processed? I have not found out which error caused the low f1 value.
@bp20200202 I just remember that we also made one mistake of only using the Titles instead of Description for the Agnews Dataset. Maybe you are also using the titles only.
Here is one simple preprocessing example that we used:
# remove text inside "()"
def remove_bracket(text):
return re.sub(' +',' ',re.sub(r'\([^)]*\)', '', text))
# remove inside bracket, Tokenize, Stop word removal,
def preprocess(text):
result=[]
text = remove_bracket(text)
for token in gensim.utils.simple_preprocess(text) :
result.append(token)
result = " ".join(str(x) for x in result)
return result
Try this dataset: agnews_with_document_preprocessed_3200.txt agnews_with_document_preprocessed.txt
@bp20200202 I just remember that we also made one mistake of only using the Titles instead of Description for the Agnews Dataset. Maybe you are also using the titles only.
Here is one simple preprocessing example that we used:
# remove text inside "()" def remove_bracket(text): return re.sub(' +',' ',re.sub(r'\([^)]*\)', '', text)) # remove inside bracket, Tokenize, Stop word removal, def preprocess(text): result=[] text = remove_bracket(text) for token in gensim.utils.simple_preprocess(text) : result.append(token) result = " ".join(str(x) for x in result) return result
Try this dataset: agnews_with_document_preprocessed_3200.txt agnews_with_document_preprocessed.txt
thanks a lot,i will try it again。