graphium
graphium copied to clipboard
Weak baseline
TL;DR: With the correct hyperparameters, a single-task model shows the same quality as the multi-task model, contradicting Graphium doc claims. It doesn't seem like multi-tasking helps.
Hi there! One of the biggest mysteries in my life is whether transfer/multi-task learning on different modalities has ever improved prediction quality. Graphium claims that a simple GCN gets 0.773 ROC AUC on Tox21, but in a multi-task regime with QM and ZINC predictions, it reaches 0.850 ROC AUC. That's interesting, so I tested it.
I took a standard GCN, turned off the heads except for Tox21, and minimally tweaked the parameters. I found a model with 150k parameters (just as in the docs) which reaches >0.850 ROC AUC and >0.49 PR AUC predicting Tox21 only. I think that came not from a smart combination of hyperparameters, but because of letting it learn longer—it took 1200 epochs instead of the 300 used in the evaluation.
One might say that multi-task is better within a certain number of epochs. I don't think that's important. If we care about the budget, we should compare total compute (in FLOPs) or wall time. Both are much lower for the single-task model. Furthermore, I think we aren't really interested in the budget for these models since compute is much cheaper than toxicity fails.
I would recommend using early_stopping to not interrupt training prematurely.
Did you also tune the hyper-parameters of the multi-task model for a fair comparison?
In our experiment, as is detailed in the paper, we very slightly tune the parameters for the multi-task model and remove the other heads. This way, we are exactly in the same h-param regimes. Otherwise, you can spend so much time tuning single task or multi-task, and easily cheat by tuning the one you like more than the one you don't like.
Note that, for tiny models (150k parameters) it is difficult to get any benefit from multi-tasking as the model will likely underfit. The goal of graphium is not to train on tiny datasets. It is to train on thousands of labels/tasks at the same time with billion parameter models. In that regime, we clearly observe that multi-task (and multi-label) is strongly beneficial. The dataset you tried on is called "ToyMix" for a reason.
Yes, I tuned the multi-task model. You're right that it's unfair to compare models if they've been tuned with different levels of diligence, but I actually put more effort into tuning the multi-task model :). I managed to achieve a ~0.86 ROC AUC for the multi-task model, which is slightly higher than the 0.85 for the single-task model. However, I believe that's only because I stopped tuning the single-task model as soon as it hit 0.85. I shared the experiments here: (wandb for single-task) (wandb for multi-task)
You mentioned that you compared single-task vs. multi-task models without changing hyperparameters, but, of course, different architectures often require different parameters to perform optimally. In this case, the single-task GCN likely needed more epochs.
You're also correct that I only tested on a small dataset, so I can't be certain whether multi-tasking would offer an advantage on larger datasets. I've re-read your excellent paper, "On the Scalability of GNNs for Molecular Graphs," multiple times. It's a very detailed and rigorous investigation, and certainly a highlight of 2024's GNN papers—thanks for that! However, while the paper shows that scaling helps, I think it's still unclear whether multi-tasking is beneficial. I even sent two emails to Dr. Frederik Wenkel asking about the ablation studies, but he hasn't responded.
Would you expect that without pre-training on LargeMix, I wouldn't achieve such good metrics if I trained the same large model directly on solubility (or a similar benchmark)?
certainly a highlight of 2024's GNN papers
Glad you enjoyed the paper!! Hopefully the reviewers feel the same way ;)
To be honest, I never ever trained a model for 1200 epochs XD. I always cap it at 200-400.
Regarding the multi-task, in it's current form, the paper doesn't quite show it's benefits. Removing L1000 improved results, and removing PCQM4M do not change the results much. However, if you look at the "label fraction" column of Figure 2, you see that more labels are beneficial (see figure below). This is mostly due to the labels in PCBA_1328. Number of labels is similar to number of tasks, with the difference that it's a single MLP with a large output dimension instead of multiple task-heads.
To give you an example, the PCBA_128 dataset from OGB saturates at 0.32 average precision, and any large model strongly overfits there. However, multiplying by 11 the number of parameters allow us to continuously scale our performance on PCBA_1328 up to 0.41 without L1000 (sure not the same split), and it still scales beyond 1B. It was the main driver of finetuning performance. This is the evidence you need for multi-tasking.
What more? We'll update the paper soon where we added a new dataset! Learning to predict cell Phenomics embeddings. And not only it significantly improves downstream performance, but also improves the performance in the pretraining of PCBA_1328. Stay tuned.
I did enjoyed the paper, and I'm confident the reviewers will feel the same -- it's definitely an A* paper. Best of luck with it!
I agree that increasing the label fraction might suggest improvements from multi-task learning. It’s clear how this could boost predictions for tasks like hERG/CYP and certain toxicity metrics, but what's really fascinating is that multi-tasking also enhances performance on tasks as different as solubility and permeability. The one thing I'm still uncertain about is how challenging it would be to achieve the same metrics without any multi-task pretraining, and which specific modalities contribute most to the observed improvements. I might need to try training those modalities myself to get a clearer picture.
Regarding the improvement on PCBA, I’m not sure we can make a direct comparison. First, OGB uses a scaffold split, which is considered more challenging than the random split in LargeMix. Second, PR AUC is highly dependent on class balance, which differs between the datasets (1.4% positives in OGB PCBA 128 vs. 3.5% in PCBA 1328). Lastly, there’s simply more data! Even if the amount of datapoints for DRD2 agonists is the same, adding DRD3 agonists naturally helps, as these targets share many binders.
p.s. Looking forward to phenomics experiments!
Yeah there's so many things that one can explore! Ideally, we would try to embed the task description and see if it helps multi-tasking of similar VS dissimilar tasks.
Sure, the PR AUC is not directly comparable between 128 and 1328, but there is not more data on these 128 assays, the additional data is on other assays. One thing is clear: large models overfit on PCBA128, but not on PCBA1328, which indicates that more labels help. But for fair comparison, we would need to make sure to use the same split, and report our metrics only on the same 128 labels.
@SteshinSS , the new version of the MolGPS paper includes the Phenomics experiments in Figure 5.
I will be closing the issue from what we concluded in the discussion. Feel free to open it again if you think it should be discussed futher.