How to include node type with both labeled and unlabeled members
I have a use case for a single node type that has a mixture of labeled and unlabeled members. The docs sound like this is handled:
"These JSON files only need to list the IDs on its own set. For example, in a node classification task, there are 100 nodes and node ID starts from 0, and assume the last 50 nodes (ID from 49 to 99) have labels associated. ...."
I have parquet files for my nodes. During the graph construction routine, the creation of the final numpy arrays for labels seemed to indicate that I needed dummy labels for these nodes. How am I supposed to handle this use case where I want to keep to a single node type for this data?
Thank you!
Hi @robertreaney is this a correct description of your data?
my_node_type.parquet
node_id,label
a,0
b,1
c,NaN
So some of your nodes have labels while others don't and you want to only include in your train/validation/test set the nodes that have a label (a,b) in this case?
You could do this by creating e.g. a train/validation/test_ids.csv with contents
node_id
a
b
So for nodes that do not have labels, you would simply not list their ids in the custom split files you create.
@thvasilo this is correct. For clarity before closing this ticket, that unlabeled nodes must have a value of agreeable type. The correct way to provide unlabeled nodes to the gconstruct process is with a dummy label. I have used -1 in my codebase.
node_id,label
a,0
b,1
c,-1