IGB-Datasets icon indicating copy to clipboard operation
IGB-Datasets copied to clipboard

The training set accuracy is much lower than the test set accuracy.

Open yichuan-w opened this issue 2 years ago • 31 comments

Provide context and explain with the example if possible and relevant. BTW, Why is the training node accuracy for IGB much lower than the validation and test accuracy? Usually, based on OGB experience, due to the existence of dropout, there might be around a 1% drop, but here there's a difference of 10%, which is quite peculiar. Is there any specific algorithm or method used by IGB when splitting the train, validation, and test datasets that could lead to this situation? image

yichuan-w avatar Sep 27 '23 16:09 yichuan-w

Could you please let me know what dataset size this is, what model you're using and what the acc numbers are?

As for your question the splits are random but based on our baseline tests the accuracies shouldn't be that different.

akhatua2 avatar Sep 27 '23 17:09 akhatua2

I just use the model https://github.com/IllinoisGraphBenchmark/IGB-Datasets/blob/main/igb/train_single_gpu.py here and use tiny dataset

yichuan-w avatar Sep 27 '23 18:09 yichuan-w

174f1fe042b3c3547b1fcf67adbd78d sry about that, this is the accuracy result

yichuan-w avatar Sep 27 '23 18:09 yichuan-w

so do you have any idea about this abnormal phenomenon. it seems have some problem about the distribution of train node or whether it has some bug in the dataset?

yichuan-w avatar Sep 27 '23 20:09 yichuan-w

I will look into it soon and confirm why this is happening and what we were doing differently. Please elt me know if you see this happening with IGB-small and the other models.

akhatua2 avatar Sep 27 '23 20:09 akhatua2

as I tested small behaves the same. I hope it's not due to some other parameters I've changed. I only adjusted some model parameters. It would be best if you could run a tiny check to double-confirm.

yichuan-w avatar Sep 27 '23 21:09 yichuan-w

Until I get a chance to look into it (don't have the setup with me rn), feel free to shuffle the dataset and continue your experiments.

akhatua2 avatar Sep 27 '23 21:09 akhatua2

cool i will try that

yichuan-w avatar Sep 27 '23 22:09 yichuan-w

How's problem doing recently? It seems that this issue still exists

yichuan-w avatar Oct 01 '23 00:10 yichuan-w

Screen Shot 2023-10-01 at 10 48 23 AM

Hello Yichuan

I ran this model using the default setup and I wasn't being able to reproduce those numbers. I also looked in the distribution of labels in the train/val/test splits but I don't see any difference:

image image image

Here y in the number of nodes (in case of train set I divided by 3 to normalize) and x is the label.

I'll take a look at the other models and dataset sizes and edge density distribution and let you know if I can explain why you are getting those results.

akhatua2 avatar Oct 01 '23 18:10 akhatua2

Thank you for your response. Can you tell me what the x-axis and y-axis represent in this context?

yichuan-w avatar Oct 01 '23 18:10 yichuan-w

Updated previous comment. Closing this issue for now. Will set a backlog to explore this. You could potentially plot out your train/test/val label districution to confirm if it matches the plots I linked above.

akhatua2 avatar Oct 01 '23 18:10 akhatua2

I'm sorry for the inconvenience. Can you share your code with me? I used the code from this link, but I still can't reproduce your results. I just want to confirm if the code you used is the same as mine.,there be some differences in your code, such as model size or GPU usage in MB?

yichuan-w avatar Oct 01 '23 21:10 yichuan-w

Ahh, I see the code in the repo only shows the train acc of the last batch. Maybe try using this in the train loop? (notice the train_acc is a list here)

    for epoch in tqdm.tqdm(range(args.epochs)):
        # Loop over the dataloader to sample the computation dependency graph as a list of
        # blocks.
        epoch_loss = 0
        gpu_mem_alloc = 0
        epoch_start = time.time()
        model.train()
        train_acc = []
        for step, (input_nodes, seeds, blocks) in enumerate(train_dataloader):
            blocks = [block.int().to(device) for block in blocks]
            batch_inputs = blocks[0].srcdata['feat']
            batch_labels = blocks[-1].dstdata['label']

            batch_pred = model(blocks, batch_inputs)
            loss = loss_fcn(batch_pred, batch_labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            epoch_loss += loss.detach()
            train_acc.append(sklearn.metrics.accuracy_score(batch_labels.cpu().numpy(), 
                batch_pred.argmax(1).detach().cpu().numpy())*100)

            gpu_mem_alloc = (
                torch.cuda.max_memory_allocated() / 1000000
                if torch.cuda.is_available()
                else 0
            )
        if epoch%args.log_every == 0:
            model.eval()
            predictions = []
            labels = []
            with torch.no_grad():
                for _, _, blocks in val_dataloader:
                    blocks = [block.to(device) for block in blocks]
                    inputs = blocks[0].srcdata['feat']
                    labels.append(blocks[-1].dstdata['label'].cpu().numpy())
                    predictions.append(model(blocks, inputs).argmax(1).cpu().numpy())
                predictions = np.concatenate(predictions)
                labels = np.concatenate(labels)
                val_acc = sklearn.metrics.accuracy_score(labels, predictions)*100
                if best_accuracy < val_acc:
                    best_accuracy = val_acc
                    if args.model_save:
                        torch.save(model.state_dict(), args.modelpath)

            tqdm.tqdm.write(
                "Epoch {:03d} | Loss {:.4f} | Train Acc {:.2f} | Val Acc {:.2f} | Time {} | GPU {:.1f} MB".format(
                    epoch,
                    epoch_loss,
                    np.mean(train_acc),
                    val_acc,
                    str(datetime.timedelta(seconds = int(time.time() - epoch_start))),
                    gpu_mem_alloc
                )
            )

That's only thing I can see is different. The above ss was results trained on a AMD EPYC 7R32 cpu. Hope that helps.

akhatua2 avatar Oct 01 '23 21:10 akhatua2

hmm, I made changes to this part before, but it still doesn't seem right. Can you share the entire code?

yichuan-w avatar Oct 01 '23 21:10 yichuan-w

image the result is sth like that

yichuan-w avatar Oct 01 '23 21:10 yichuan-w

I ran the download.py to download the dataset

yichuan-w avatar Oct 01 '23 21:10 yichuan-w

so maybe there is some bug in the code in the repo?

yichuan-w avatar Oct 01 '23 21:10 yichuan-w

train_single_gpu.py.zip

Here is the file I'm using. Feel free to use this script. Unfortunately given that the label distribution is similar in the splits, I cannot think of a reason why you are getting this results. That said please let me know if you can replicate this discrepency in the larger dataset like large or full.

akhatua2 avatar Oct 01 '23 21:10 akhatua2

It's still quite puzzling. I can't reproduce your results. To confirm, there shouldn't be any issues if I directly run download.py to download 'tiny' and then execute IGB-Datasets/igb/train_single_gpu.py, right? But my results on both 'tiny' and 'small' lead to the same conclusion that train accuracy is always about 10 points lower than val accuracy. Also, your results shown before don't seem to align well with what's mentioned in the paper image and it looks my result is more reasonable? as for the val acc&&test acc

yichuan-w avatar Oct 01 '23 22:10 yichuan-w

Hey Yichuan,

I pulled the code from the repo and ran it and I am able to reproduce your results. Seems like there is indeed a ~10% difference in the train acc. The repo code seems to be functionally correct. Maybe reshuffling the dataset (assing random splits) will cause the results to be more uniform.

In traditional datasets train acc is expected to be higher/similar to the val and test if the dataset labels isn't extremely skewed between the train/val/test split. However, in our case the splits have similar label distribution.

This makes me imagine the density of the edges in the these graph splits play an important role. I just ran the numbers and looks like in case of tiny, small and medium due to their sizes are more vulnerable to this skew in edge density.

The tiny dataset (strictly within split i.e src node and dst node is both in same split)

  • train split has an average of 3.6 edges per node
  • val split has an average of 7.29 edges per node
  • test split has an average of 9.04 edges per node.

I believe this skew in edge density is causing the low train acc for graph models. This is an interesting property that seems to be unique to graph datasets.

We will examine the impact of this in the train acc and build more intelligent train/val/test splits including edge density information. Thanks for bringing this up which allowed us to highlight this interesting phenomenon. We will keep you posted on updated splits and add this to the readme.

akhatua2 avatar Oct 01 '23 22:10 akhatua2

Yes, this is indeed an intriguing phenomenon. But another point is that if you modify the 'val_dataloader' as:

val_dataloader = dgl.dataloading.DataLoader( g, train_nid, sampler, batch_size=args.batch_size, shuffle=False, drop_last=False, num_workers=args.num_workers) the results remain consistent(10% more accurate in val acc). This makes me wonder if there's some functional error.And that is why I don not agree with your point of edge density Furthermore, I have two questions:

Why were your previous results correct? For your experiments, does this phenomenon only appear if shuffled once, and then it doesn't occur?

yichuan-w avatar Oct 01 '23 23:10 yichuan-w

Also what do you mean by this? I tried reshuffling the dataset (assing random splits) and this causes the results to be more uniform. What specific operations did you do?

yichuan-w avatar Oct 02 '23 00:10 yichuan-w


            n_nodes = node_features.shape[0]
            n_train = int(n_nodes * self.args.train_percent)
            n_val   = int(n_nodes * self.args.val_percent)
            n_test  = int(n_nodes * self.args.test_percent)
            ## randon incidies of train node
            train_idx = np.random.choice(n_nodes, n_train, replace=False)
            val_idx = np.random.choice(n_nodes, n_val, replace=False)
            test_idx = np.random.choice(n_nodes, n_test, replace=False)
            
            train_mask = torch.zeros(n_nodes, dtype=torch.bool)
            val_mask = torch.zeros(n_nodes, dtype=torch.bool)
            test_mask = torch.zeros(n_nodes, dtype=torch.bool)
            
            train_mask[train_idx] = True
            val_mask[val_idx] = True
            test_mask[test_idx] = True
            
            self.graph.ndata['train_mask'] = train_mask
            self.graph.ndata['val_mask'] = val_mask
            self.graph.ndata['test_mask'] = test_mask

I change sth like that but the result is still the same(train acc higher than val acc abount 10%)

yichuan-w avatar Oct 02 '23 00:10 yichuan-w

Yes, this is indeed an intriguing phenomenon. But another point is that if you modify the 'val_dataloader' as:

val_dataloader = dgl.dataloading.DataLoader( g, train_nid, sampler, batch_size=args.batch_size, shuffle=False, drop_last=False, num_workers=args.num_workers) the results remain consistent(10% more accurate in val acc). This makes me wonder if there's some functional error.And that is why I don not agree with your point of edge density Furthermore, I have two questions:

Hey I was able to reproduce your results. Yeah I'm leaning towards that it could potentially be a functional error. I will need to debug it. Please let me know if you find anything that stands out since I don't see anything obvious.

Also what do you mean by this? I tried reshuffling the dataset (assing random splits) and this causes the results to be more uniform. What specific operations did you do?

This was a hypothesis as I assumed it could be an edge density related issue. Might not be relevant. Maybe you can experiment byt using a DGL train function and just use the dataloader for IGB.

akhatua2 avatar Oct 02 '23 03:10 akhatua2

So, how did you produce the results in the previous chart? May I ask? or that code has bug? image

yichuan-w avatar Oct 02 '23 04:10 yichuan-w

That had some issue with my local dataloader.

akhatua2 avatar Oct 02 '23 04:10 akhatua2

OK isee

yichuan-w avatar Oct 02 '23 04:10 yichuan-w

Yeah I'm leaning towards that it could potentially be a functional error. I will need to debug it.

agreed

yichuan-w avatar Oct 02 '23 04:10 yichuan-w

BTW we were informed that there is some issue with the DGL sampler. Please check out this pull request -> Creating a DGL sampler that matches PyG/GLT sampling result #36.

It is possible that this issue is related.

akhatua2 avatar Oct 02 '23 04:10 akhatua2