gcn icon indicating copy to clipboard operation
gcn copied to clipboard

Doubt about featureless dataset

Open llan-ml opened this issue 5 years ago • 5 comments

Hi Thomas,

Thanks for sharing your code. I want to use GCN for datasets without features. From previous issues, you mentioned that one option is to use embeddings learned by other unsupervised methods as features.

I have a try on BlogCatalog data and employ DeepWalk to learn node embeddings. The original BlogCatalog is a multi-label dataset, and I randomly remove some labels of nodes to turn the data into a multi-class one, for simplicity.

Although I tried different tuning, including layer units, numbers of layers, activations, dropout, and weight decay, but the training accuracy is just very bad. Here are some training logs:

Epoch: 0001 train_loss= 3.71325 train_acc= 0.06095 val_loss= 3.56659 val_acc= 0.11600 time= 0.95786
Epoch: 0002 train_loss= 3.59333 train_acc= 0.12311 val_loss= 3.61276 val_acc= 0.11600 time= 0.82414
Epoch: 0003 train_loss= 3.81428 train_acc= 0.12311 val_loss= 3.42935 val_acc= 0.04400 time= 0.83664
Epoch: 0004 train_loss= 3.48834 train_acc= 0.06166 val_loss= 3.51022 val_acc= 0.06200 time= 0.85404
Epoch: 0005 train_loss= 3.52989 train_acc= 0.07022 val_loss= 3.53583 val_acc= 0.06200 time= 0.81449
Epoch: 0006 train_loss= 3.54849 train_acc= 0.07063 val_loss= 3.51908 val_acc= 0.06200 time= 0.83169
Epoch: 0007 train_loss= 3.53239 train_acc= 0.07073 val_loss= 3.47710 val_acc= 0.06200 time= 0.82780
Epoch: 0008 train_loss= 3.50293 train_acc= 0.07063 val_loss= 3.44566 val_acc= 0.06200 time= 0.83416
Epoch: 0009 train_loss= 3.49422 train_acc= 0.07073 val_loss= 3.42412 val_acc= 0.06200 time= 0.80358
Epoch: 0010 train_loss= 3.47205 train_acc= 0.06900 val_loss= 3.41098 val_acc= 0.06200 time= 0.83321
Epoch: 0011 train_loss= 3.44815 train_acc= 0.06370 val_loss= 3.40743 val_acc= 0.08200 time= 0.83813
Epoch: 0012 train_loss= 3.43334 train_acc= 0.06217 val_loss= 3.40451 val_acc= 0.08200 time= 0.84626
Epoch: 0013 train_loss= 3.42193 train_acc= 0.09234 val_loss= 3.40040 val_acc= 0.11600 time= 0.82693
Epoch: 0014 train_loss= 3.41513 train_acc= 0.12311 val_loss= 3.40461 val_acc= 0.11600 time= 0.85945
Epoch: 0015 train_loss= 3.41538 train_acc= 0.12311 val_loss= 3.40511 val_acc= 0.11600 time= 0.80841
Epoch: 0016 train_loss= 3.41279 train_acc= 0.12311 val_loss= 3.40204 val_acc= 0.11600 time= 0.82393
Epoch: 0017 train_loss= 3.41035 train_acc= 0.12322 val_loss= 3.38996 val_acc= 0.11600 time= 0.84268
Epoch: 0018 train_loss= 3.40867 train_acc= 0.12311 val_loss= 3.37526 val_acc= 0.11600 time= 0.80071
Epoch: 0019 train_loss= 3.40945 train_acc= 0.12301 val_loss= 3.37921 val_acc= 0.11600 time= 0.83691
Epoch: 0020 train_loss= 3.40678 train_acc= 0.12291 val_loss= 3.36893 val_acc= 0.11600 time= 0.82570
Epoch: 0021 train_loss= 3.40442 train_acc= 0.12240 val_loss= 3.34923 val_acc= 0.11600 time= 0.84218
Epoch: 0022 train_loss= 3.41000 train_acc= 0.12281 val_loss= 3.36223 val_acc= 0.11600 time= 0.84202
Epoch: 0023 train_loss= 3.40565 train_acc= 0.12281 val_loss= 3.36530 val_acc= 0.11600 time= 0.87319
Epoch: 0024 train_loss= 3.40541 train_acc= 0.12281 val_loss= 3.34429 val_acc= 0.11600 time= 0.83222
Epoch: 0025 train_loss= 3.40752 train_acc= 0.12240 val_loss= 3.35407 val_acc= 0.11600 time= 0.84253
Epoch: 0026 train_loss= 3.40010 train_acc= 0.12261 val_loss= 3.36364 val_acc= 0.11600 time= 0.83736
Epoch: 0027 train_loss= 3.40163 train_acc= 0.12352 val_loss= 3.35432 val_acc= 0.11600 time= 0.83250
Epoch: 0028 train_loss= 3.40048 train_acc= 0.12271 val_loss= 3.36024 val_acc= 0.11600 time= 0.82765
Epoch: 0029 train_loss= 3.39748 train_acc= 0.12311 val_loss= 3.36625 val_acc= 0.11600 time= 0.85199
Epoch: 0030 train_loss= 3.39877 train_acc= 0.12311 val_loss= 3.36346 val_acc= 0.11600 time= 0.83927
Epoch: 0031 train_loss= 3.39693 train_acc= 0.12311 val_loss= 3.36568 val_acc= 0.11600 time= 0.83743
Epoch: 0032 train_loss= 3.39625 train_acc= 0.12332 val_loss= 3.36298 val_acc= 0.11600 time= 0.82265
Epoch: 0033 train_loss= 3.39562 train_acc= 0.12271 val_loss= 3.37086 val_acc= 0.11600 time= 0.85519
Epoch: 0034 train_loss= 3.39694 train_acc= 0.12311 val_loss= 3.35922 val_acc= 0.11600 time= 0.84654
Epoch: 0035 train_loss= 3.39806 train_acc= 0.12301 val_loss= 3.37311 val_acc= 0.11600 time= 0.83603

Have you encountered similar problems? It would be grateful if you could provide some advices.

llan-ml avatar Mar 29 '19 10:03 llan-ml

Maybe your label processing scheme introduced some issues? You can use multi-label targets by replacing the loss function accordingly.

On Fri 29. Mar 2019 at 11:45 lanlin [email protected] wrote:

Hi Thomas,

Thanks for sharing your code. I want to use GCN for datasets without features. From previous issues, you mentioned that one option is to use embeddings learned by other unsupervised methods as features.

I have a try on BlogCatalog data and employ DeepWalk to learn node embeddings. The original BlogCatalog is a multi-label dataset, and I randomly remove some labels of nodes to turn the data into a multi-class one, for simplicity.

Although I tried different tuning, including layer units, numbers of layers, activations, dropout, and weight decay, but the training accuracy is just very bad. Here are some training logs:

Epoch: 0001 train_loss= 3.71325 train_acc= 0.06095 val_loss= 3.56659 val_acc= 0.11600 time= 0.95786 Epoch: 0002 train_loss= 3.59333 train_acc= 0.12311 val_loss= 3.61276 val_acc= 0.11600 time= 0.82414 Epoch: 0003 train_loss= 3.81428 train_acc= 0.12311 val_loss= 3.42935 val_acc= 0.04400 time= 0.83664 Epoch: 0004 train_loss= 3.48834 train_acc= 0.06166 val_loss= 3.51022 val_acc= 0.06200 time= 0.85404 Epoch: 0005 train_loss= 3.52989 train_acc= 0.07022 val_loss= 3.53583 val_acc= 0.06200 time= 0.81449 Epoch: 0006 train_loss= 3.54849 train_acc= 0.07063 val_loss= 3.51908 val_acc= 0.06200 time= 0.83169 Epoch: 0007 train_loss= 3.53239 train_acc= 0.07073 val_loss= 3.47710 val_acc= 0.06200 time= 0.82780 Epoch: 0008 train_loss= 3.50293 train_acc= 0.07063 val_loss= 3.44566 val_acc= 0.06200 time= 0.83416 Epoch: 0009 train_loss= 3.49422 train_acc= 0.07073 val_loss= 3.42412 val_acc= 0.06200 time= 0.80358 Epoch: 0010 train_loss= 3.47205 train_acc= 0.06900 val_loss= 3.41098 val_acc= 0.06200 time= 0.83321 Epoch: 0011 train_loss= 3.44815 train_acc= 0.06370 val_loss= 3.40743 val_acc= 0.08200 time= 0.83813 Epoch: 0012 train_loss= 3.43334 train_acc= 0.06217 val_loss= 3.40451 val_acc= 0.08200 time= 0.84626 Epoch: 0013 train_loss= 3.42193 train_acc= 0.09234 val_loss= 3.40040 val_acc= 0.11600 time= 0.82693 Epoch: 0014 train_loss= 3.41513 train_acc= 0.12311 val_loss= 3.40461 val_acc= 0.11600 time= 0.85945 Epoch: 0015 train_loss= 3.41538 train_acc= 0.12311 val_loss= 3.40511 val_acc= 0.11600 time= 0.80841 Epoch: 0016 train_loss= 3.41279 train_acc= 0.12311 val_loss= 3.40204 val_acc= 0.11600 time= 0.82393 Epoch: 0017 train_loss= 3.41035 train_acc= 0.12322 val_loss= 3.38996 val_acc= 0.11600 time= 0.84268 Epoch: 0018 train_loss= 3.40867 train_acc= 0.12311 val_loss= 3.37526 val_acc= 0.11600 time= 0.80071 Epoch: 0019 train_loss= 3.40945 train_acc= 0.12301 val_loss= 3.37921 val_acc= 0.11600 time= 0.83691 Epoch: 0020 train_loss= 3.40678 train_acc= 0.12291 val_loss= 3.36893 val_acc= 0.11600 time= 0.82570 Epoch: 0021 train_loss= 3.40442 train_acc= 0.12240 val_loss= 3.34923 val_acc= 0.11600 time= 0.84218 Epoch: 0022 train_loss= 3.41000 train_acc= 0.12281 val_loss= 3.36223 val_acc= 0.11600 time= 0.84202 Epoch: 0023 train_loss= 3.40565 train_acc= 0.12281 val_loss= 3.36530 val_acc= 0.11600 time= 0.87319 Epoch: 0024 train_loss= 3.40541 train_acc= 0.12281 val_loss= 3.34429 val_acc= 0.11600 time= 0.83222 Epoch: 0025 train_loss= 3.40752 train_acc= 0.12240 val_loss= 3.35407 val_acc= 0.11600 time= 0.84253 Epoch: 0026 train_loss= 3.40010 train_acc= 0.12261 val_loss= 3.36364 val_acc= 0.11600 time= 0.83736 Epoch: 0027 train_loss= 3.40163 train_acc= 0.12352 val_loss= 3.35432 val_acc= 0.11600 time= 0.83250 Epoch: 0028 train_loss= 3.40048 train_acc= 0.12271 val_loss= 3.36024 val_acc= 0.11600 time= 0.82765 Epoch: 0029 train_loss= 3.39748 train_acc= 0.12311 val_loss= 3.36625 val_acc= 0.11600 time= 0.85199 Epoch: 0030 train_loss= 3.39877 train_acc= 0.12311 val_loss= 3.36346 val_acc= 0.11600 time= 0.83927 Epoch: 0031 train_loss= 3.39693 train_acc= 0.12311 val_loss= 3.36568 val_acc= 0.11600 time= 0.83743 Epoch: 0032 train_loss= 3.39625 train_acc= 0.12332 val_loss= 3.36298 val_acc= 0.11600 time= 0.82265 Epoch: 0033 train_loss= 3.39562 train_acc= 0.12271 val_loss= 3.37086 val_acc= 0.11600 time= 0.85519 Epoch: 0034 train_loss= 3.39694 train_acc= 0.12311 val_loss= 3.35922 val_acc= 0.11600 time= 0.84654 Epoch: 0035 train_loss= 3.39806 train_acc= 0.12301 val_loss= 3.37311 val_acc= 0.11600 time= 0.83603

Have you encountered similar problems? It would be grateful if you could provide some advices.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tkipf/gcn/issues/96, or mute the thread https://github.com/notifications/unsubscribe-auth/AHAcYGH6bxRBzG8l4Qwfm4WmphUDSUcEks5vbe62gaJpZM4cSHcp .

tkipf avatar Mar 29 '19 11:03 tkipf

The label processing is quite simple as follows:

    G = nx.read_gpickle("./data/BlogCatalog_nx.pkl")
    adj = nx.adjacency_matrix(G)
    labels = G.graph["label_array"]
    for i in range(len(labels)):
        num_labels = labels[i].sum()
        if num_labels == 1:
            continue
        else:
            idx = np.random.choice(labels[i].nonzero()[0], size=num_labels - 1, replace=False)
            idx = nest.flatten(idx)
            for j in idx:
                labels[i][j] = 0

I used the graph G and labels label_array, along with embeddings learned by DeepWalk, to train a logistic regression classifier, and the results are normal. Thus, there should be no faults during data processing.

I also tried the multi-label setting, and replaced the loss function with:

def masked_sigmoid_cross_entropy(preds, labels, mask):
    loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=preds, labels=labels)
    mask = tf.cast(mask, dtype=tf.bool)
    masked_loss = tf.boolean_mask(loss, mask)
    masked_labels = tf.boolean_mask(labels, mask)
    weights = masked_labels * 10 + 1
    return tf.reduce_mean(weights * masked_loss)

Since the label array is quite sparse, weights is used to make sure the model does not cheat to generate all 0s for all nodes and labels, and the use of weights can increase the best performance from 16% to 20%.

BTW, have you ever successfully applied GCN to featureless graph data? Thanks.

llan-ml avatar Mar 29 '19 11:03 llan-ml

I’ve trained GCN models with features derived from graph structure (e.g. node degree, or DeepWalk embeddings) before and this generally worked quite well. If your GCN model is performing worse than a logistic regression classifier, then maybe try using some of the other propagation models (e.g. full first-order) as described in our paper. These should perform at least as good as a logistic regression classifier if trained correctly.

On Fri 29. Mar 2019 at 12:35 lanlin [email protected] wrote:

The label processing is quite simple as follows:

G = nx.read_gpickle("./data/BlogCatalog_nx.pkl")
adj = nx.adjacency_matrix(G)
labels = G.graph["label_array"]
for i in range(len(labels)):
    num_labels = labels[i].sum()
    if num_labels == 1:
        continue
    else:
        idx = np.random.choice(labels[i].nonzero()[0], size=num_labels - 1, replace=False)
        idx = nest.flatten(idx)
        for j in idx:
            labels[i][j] = 0

I used the graph G and labels label_array, along with embeddings learned by DeepWalk, to train a logistic regression classifier, and the results are normal. Thus, there should be no faults during data processing.

I also tried the multi-label setting, and replaced the loss function with:

def masked_sigmoid_cross_entropy(preds, labels, mask): loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=preds, labels=labels) mask = tf.cast(mask, dtype=tf.bool) masked_loss = tf.boolean_mask(loss, mask) masked_labels = tf.boolean_mask(labels, mask) weights = masked_labels * 10 + 1 return tf.reduce_mean(weights * masked_loss)

Since the label array is quite sparse, weights is used to make sure the model does not cheat to generate all 0s for all nodes and labels, and the use of weights can increase the best performance from 16% to 20%.

BTW, have you ever successfully applied GCN to featureless graph data? Thanks.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/tkipf/gcn/issues/96#issuecomment-477966536, or mute the thread https://github.com/notifications/unsubscribe-auth/AHAcYD8qf2AWm8T954yxeN-gWas3yGV8ks5vbfpogaJpZM4cSHcp .

tkipf avatar Mar 29 '19 11:03 tkipf

I tried Cora dataset. The GCN model with DeepWalk embeddings as input quickly converge, and the results look quite well. However, the same code on BlogCatalog (with some hyper-tuning) yields poor performance. One possible issue is that the label information in BlogCatalog has a higher dimension (39 vs. 3) and is more sparse.

llan-ml avatar Mar 29 '19 12:03 llan-ml

Thanks for your sharing, can you share how to use vgae without features? @llan-ml

cquzys avatar Dec 05 '19 06:12 cquzys