The training set accuracy is much lower than the test set accuracy.
Provide context and explain with the example if possible and relevant.
BTW, Why is the training node accuracy for IGB much lower than the validation and test accuracy? Usually, based on OGB experience, due to the existence of dropout, there might be around a 1% drop, but here there's a difference of 10%, which is quite peculiar. Is there any specific algorithm or method used by IGB when splitting the train, validation, and test datasets that could lead to this situation?
Could you please let me know what dataset size this is, what model you're using and what the acc numbers are?
As for your question the splits are random but based on our baseline tests the accuracies shouldn't be that different.
I just use the model https://github.com/IllinoisGraphBenchmark/IGB-Datasets/blob/main/igb/train_single_gpu.py here and use tiny dataset
sry about that, this is the accuracy result
so do you have any idea about this abnormal phenomenon. it seems have some problem about the distribution of train node or whether it has some bug in the dataset?
I will look into it soon and confirm why this is happening and what we were doing differently. Please elt me know if you see this happening with IGB-small and the other models.
as I tested small behaves the same. I hope it's not due to some other parameters I've changed. I only adjusted some model parameters. It would be best if you could run a tiny check to double-confirm.
Until I get a chance to look into it (don't have the setup with me rn), feel free to shuffle the dataset and continue your experiments.
cool i will try that
How's problem doing recently? It seems that this issue still exists
Hello Yichuan
I ran this model using the default setup and I wasn't being able to reproduce those numbers. I also looked in the distribution of labels in the train/val/test splits but I don't see any difference:
Here y in the number of nodes (in case of train set I divided by 3 to normalize) and x is the label.
I'll take a look at the other models and dataset sizes and edge density distribution and let you know if I can explain why you are getting those results.
Thank you for your response. Can you tell me what the x-axis and y-axis represent in this context?
Updated previous comment. Closing this issue for now. Will set a backlog to explore this. You could potentially plot out your train/test/val label districution to confirm if it matches the plots I linked above.
I'm sorry for the inconvenience. Can you share your code with me? I used the code from this link, but I still can't reproduce your results. I just want to confirm if the code you used is the same as mine.,there be some differences in your code, such as model size or GPU usage in MB?
Ahh, I see the code in the repo only shows the train acc of the last batch. Maybe try using this in the train loop? (notice the train_acc is a list here)
for epoch in tqdm.tqdm(range(args.epochs)):
# Loop over the dataloader to sample the computation dependency graph as a list of
# blocks.
epoch_loss = 0
gpu_mem_alloc = 0
epoch_start = time.time()
model.train()
train_acc = []
for step, (input_nodes, seeds, blocks) in enumerate(train_dataloader):
blocks = [block.int().to(device) for block in blocks]
batch_inputs = blocks[0].srcdata['feat']
batch_labels = blocks[-1].dstdata['label']
batch_pred = model(blocks, batch_inputs)
loss = loss_fcn(batch_pred, batch_labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss += loss.detach()
train_acc.append(sklearn.metrics.accuracy_score(batch_labels.cpu().numpy(),
batch_pred.argmax(1).detach().cpu().numpy())*100)
gpu_mem_alloc = (
torch.cuda.max_memory_allocated() / 1000000
if torch.cuda.is_available()
else 0
)
if epoch%args.log_every == 0:
model.eval()
predictions = []
labels = []
with torch.no_grad():
for _, _, blocks in val_dataloader:
blocks = [block.to(device) for block in blocks]
inputs = blocks[0].srcdata['feat']
labels.append(blocks[-1].dstdata['label'].cpu().numpy())
predictions.append(model(blocks, inputs).argmax(1).cpu().numpy())
predictions = np.concatenate(predictions)
labels = np.concatenate(labels)
val_acc = sklearn.metrics.accuracy_score(labels, predictions)*100
if best_accuracy < val_acc:
best_accuracy = val_acc
if args.model_save:
torch.save(model.state_dict(), args.modelpath)
tqdm.tqdm.write(
"Epoch {:03d} | Loss {:.4f} | Train Acc {:.2f} | Val Acc {:.2f} | Time {} | GPU {:.1f} MB".format(
epoch,
epoch_loss,
np.mean(train_acc),
val_acc,
str(datetime.timedelta(seconds = int(time.time() - epoch_start))),
gpu_mem_alloc
)
)
That's only thing I can see is different. The above ss was results trained on a AMD EPYC 7R32 cpu. Hope that helps.
hmm, I made changes to this part before, but it still doesn't seem right. Can you share the entire code?
the result is sth like that
I ran the download.py to download the dataset
so maybe there is some bug in the code in the repo?
Here is the file I'm using. Feel free to use this script. Unfortunately given that the label distribution is similar in the splits, I cannot think of a reason why you are getting this results. That said please let me know if you can replicate this discrepency in the larger dataset like large or full.
It's still quite puzzling. I can't reproduce your results. To confirm, there shouldn't be any issues if I directly run download.py to download 'tiny' and then execute IGB-Datasets/igb/train_single_gpu.py, right? But my results on both 'tiny' and 'small' lead to the same conclusion that train accuracy is always about 10 points lower than val accuracy. Also, your results shown before don't seem to align well with what's mentioned in the paper
and it looks my result is more reasonable? as for the val acc&&test acc
Hey Yichuan,
I pulled the code from the repo and ran it and I am able to reproduce your results. Seems like there is indeed a ~10% difference in the train acc. The repo code seems to be functionally correct. Maybe reshuffling the dataset (assing random splits) will cause the results to be more uniform.
In traditional datasets train acc is expected to be higher/similar to the val and test if the dataset labels isn't extremely skewed between the train/val/test split. However, in our case the splits have similar label distribution.
This makes me imagine the density of the edges in the these graph splits play an important role. I just ran the numbers and looks like in case of tiny, small and medium due to their sizes are more vulnerable to this skew in edge density.
The tiny dataset (strictly within split i.e src node and dst node is both in same split)
- train split has an average of 3.6 edges per node
- val split has an average of 7.29 edges per node
- test split has an average of 9.04 edges per node.
I believe this skew in edge density is causing the low train acc for graph models. This is an interesting property that seems to be unique to graph datasets.
We will examine the impact of this in the train acc and build more intelligent train/val/test splits including edge density information. Thanks for bringing this up which allowed us to highlight this interesting phenomenon. We will keep you posted on updated splits and add this to the readme.
Yes, this is indeed an intriguing phenomenon. But another point is that if you modify the 'val_dataloader' as:
val_dataloader = dgl.dataloading.DataLoader( g, train_nid, sampler, batch_size=args.batch_size, shuffle=False, drop_last=False, num_workers=args.num_workers) the results remain consistent(10% more accurate in val acc). This makes me wonder if there's some functional error.And that is why I don not agree with your point of edge density Furthermore, I have two questions:
Why were your previous results correct? For your experiments, does this phenomenon only appear if shuffled once, and then it doesn't occur?
Also what do you mean by this? I tried reshuffling the dataset (assing random splits) and this causes the results to be more uniform. What specific operations did you do?
n_nodes = node_features.shape[0]
n_train = int(n_nodes * self.args.train_percent)
n_val = int(n_nodes * self.args.val_percent)
n_test = int(n_nodes * self.args.test_percent)
## randon incidies of train node
train_idx = np.random.choice(n_nodes, n_train, replace=False)
val_idx = np.random.choice(n_nodes, n_val, replace=False)
test_idx = np.random.choice(n_nodes, n_test, replace=False)
train_mask = torch.zeros(n_nodes, dtype=torch.bool)
val_mask = torch.zeros(n_nodes, dtype=torch.bool)
test_mask = torch.zeros(n_nodes, dtype=torch.bool)
train_mask[train_idx] = True
val_mask[val_idx] = True
test_mask[test_idx] = True
self.graph.ndata['train_mask'] = train_mask
self.graph.ndata['val_mask'] = val_mask
self.graph.ndata['test_mask'] = test_mask
I change sth like that but the result is still the same(train acc higher than val acc abount 10%)
Yes, this is indeed an intriguing phenomenon. But another point is that if you modify the 'val_dataloader' as:
val_dataloader = dgl.dataloading.DataLoader( g, train_nid, sampler, batch_size=args.batch_size, shuffle=False, drop_last=False, num_workers=args.num_workers) the results remain consistent(10% more accurate in val acc). This makes me wonder if there's some functional error.And that is why I don not agree with your point of edge density Furthermore, I have two questions:
Hey I was able to reproduce your results. Yeah I'm leaning towards that it could potentially be a functional error. I will need to debug it. Please let me know if you find anything that stands out since I don't see anything obvious.
Also what do you mean by this? I tried reshuffling the dataset (assing random splits) and this causes the results to be more uniform. What specific operations did you do?
This was a hypothesis as I assumed it could be an edge density related issue. Might not be relevant. Maybe you can experiment byt using a DGL train function and just use the dataloader for IGB.
So, how did you produce the results in the previous chart? May I ask? or that code has bug?
That had some issue with my local dataloader.
OK isee
Yeah I'm leaning towards that it could potentially be a functional error. I will need to debug it.
agreed
BTW we were informed that there is some issue with the DGL sampler. Please check out this pull request -> Creating a DGL sampler that matches PyG/GLT sampling result #36.
It is possible that this issue is related.