ProtST icon indicating copy to clipboard operation
ProtST copied to clipboard

About data leakage on zero-shot classification?

Open LTEnjoy opened this issue 2 years ago • 2 comments

Hello!

Thanks for your great work! I have tested the zero-shot classification given your released checkpoint and it did a good performance. But I am confused that whether there exists some data leakage problem? Your model was fine-tuned on Swiss-Prot database and the DeepLoc dataset was also constructed from UniProt database. Did you do some filtering when you tested zero-shot performance?

Looking forward to your reply! Thanks in advance!

LTEnjoy avatar Nov 02 '23 04:11 LTEnjoy

Hi, Thank you being interested in our work!

Please see the pre-training dataset https://github.com/DeepGraphLearning/ProtST/blob/db53a76ed2430eb66dd9c8134ace99fd60980fb3/protst/dataset.py#L22. It does not expose test labeled data of each benchmark dataset that has not been observed during multimodal pre-training nor downstream fine-tuning.

KatarinaYuan avatar Nov 03 '23 12:11 KatarinaYuan

Hi,

Thank you for the reply and I'll check it out!

LTEnjoy avatar Nov 04 '23 04:11 LTEnjoy