graphstorm icon indicating copy to clipboard operation
graphstorm copied to clipboard

How to include node type with both labeled and unlabeled members

Open robertreaney opened this issue 5 months ago • 1 comments

I have a use case for a single node type that has a mixture of labeled and unlabeled members. The docs sound like this is handled:

source

"These JSON files only need to list the IDs on its own set. For example, in a node classification task, there are 100 nodes and node ID starts from 0, and assume the last 50 nodes (ID from 49 to 99) have labels associated. ...."

I have parquet files for my nodes. During the graph construction routine, the creation of the final numpy arrays for labels seemed to indicate that I needed dummy labels for these nodes. How am I supposed to handle this use case where I want to keep to a single node type for this data?

Thank you!

robertreaney avatar Aug 13 '25 18:08 robertreaney

Hi @robertreaney is this a correct description of your data?

my_node_type.parquet

node_id,label
a,0
b,1
c,NaN

So some of your nodes have labels while others don't and you want to only include in your train/validation/test set the nodes that have a label (a,b) in this case?

You could do this by creating e.g. a train/validation/test_ids.csv with contents

node_id
a
b

So for nodes that do not have labels, you would simply not list their ids in the custom split files you create.

thvasilo avatar Aug 27 '25 00:08 thvasilo

@thvasilo this is correct. For clarity before closing this ticket, that unlabeled nodes must have a value of agreeable type. The correct way to provide unlabeled nodes to the gconstruct process is with a dummy label. I have used -1 in my codebase.

node_id,label
a,0
b,1
c,-1

robertreaney avatar Dec 02 '25 14:12 robertreaney