SaProt icon indicating copy to clipboard operation
SaProt copied to clipboard

Inconsistency in number of classes in EC/GO downstream datasets

Open klemens-floege opened this issue 1 year ago • 1 comments

Dear all,

I believe to have found some major flaws in the EC/GO downstream datasets you linked on your google drive (https://drive.google.com/drive/folders/11dNGqPYfLE3M-Mbh4U7IQpuHxJpuRr4g).

In the SaProt codebase, in the SaProtAnnotationModel class you specify the number of classes in these datasets to be: label2num = {"EC": 585, "GO_BP": 1943, "GO_MF": 489, "GO_CC": 320}. However, when investigating the EC dataset for example, I only find 366 distinct classes in the training set, 263 in test and 287 in the validation. Similar issues arise in all the three GO datasets. This seems like an ill-posed classification problem to me and I would appreciate some clarification.

Thank you very much for taking the time to look into this.

PS: Here is the simple Pandas code I used for the analysis. ` df_test = pd.read_csv(ec_test_path) df_train = pd.read_csv(ec_train_path) df_valid = pd.read_csv(ec_valid_path)

df_train['class'].nunique()=366 df_test['class'].nunique()=263 df_valid['class'].nunique()=287

Convert 'class' columns to sets train_classes = set(df_train['class']) valid_classes = set(df_valid['class']) test_classes = set(df_test['class'])

Find the intersection of the two sets intersection_train_val = train_classes.intersection(valid_classes) intersection_train_test = train_classes.intersection(test_classes) intersection_val_test = valid_classes.intersection(test_classes)

len(intersection_train_val)=287 len(intersection_train_test)=262 len(intersection_val_test)=207

`

klemens-floege avatar Jul 31 '24 10:07 klemens-floege

Hi, Thank you for your interest in our work!

Could you explain more about how you define "distinct class"? The EC and GO tasks are multiple binary classification tasks, which means a protein is mapped to multiple labels for different functions, each being 0 or 1 to indicate whether the protein has a specific function. For instance, the number "585" for the EC task means a protein has 585 binary labels such as 0 1 0 ... 1 0 0. The 1 at specific position indicates the protein has that function.

LTEnjoy avatar Jul 31 '24 14:07 LTEnjoy