IEConv_proteins
IEConv_proteins copied to clipboard
The Enzyme dataset contains a mix between train and test splits.
I have looked at the dateset in more detail and found many examples of protein sequences that are the same. In the paper it is mentioned that the similarity is less than 100%. I am assuming it is less than equal to 100%.
On a more serious note, there are examples of mixed samples between test-train and train-val.
Protein chains that are the same across splits ({id in pdb}|{original id}|{split}):
['5lf0_W|5lf0_W|train', '5m32_I|5m32_I|train', '5le5_I|5le5_I|train', '5lf1_I|5lf1_I|train', '5lf3_I|5lf3_I|train', '5gjq_q|5gjq_q|valid']
5lf0: Human 20S proteasome complex with Epoxomicin at 2.4 Angstrom 5m32: Human 26S proteasome in complex with Oprozomib 5le5: Native human 20S proteasome at 1.8 Angstrom 5lf1: Human 20S proteasome complex with Dihydroeponemycin at 2.0 Angstrom 5lf3: Human 20S proteasome complex with Bortezomib at 2.1 Angstrom 5gjq: Structure of the human 26S proteasome bound to USP14-UbAl
All the chains used from these complexes all have the same sequence pointing to: Proteasome subunit beta type-3 UniProtKB accession: P49720
['3von_E|3von_E|train', '3von_b|3von_b|test', '3von_p|3von_p|test', '3von_i|3von_i|test'] 3von: Crystalstructure of the ubiquitin protease
All the chains used from this complex all have the same sequence pointing to:: Ubiquitin-conjugating enzyme E2 N UniProtKB accession: P61088
['3mg8_I|3mg8_I|train', '4qlq_W|4qlq_W|train', '6huv_I|6huv_I|train', '5fga_W|5fga_W|train', '4qby_W|4qby_W|train', '5mpa_j|5mpa_j|test', '5mp9_j|5mp9_j|test']
3mg8:Structure of yeast 20S open-gate proteasome with Compound 16 4qlq: yCP in complex with tripeptidic epoxyketone inhibitor 8 6huv: Yeast 20S proteasome with human beta2c (S171G) in complex with 39 5fga: Yeast 20S proteasome beta5-K33A mutant (propeptide expressed in trans) 4qby: yCP in complex with BOC-ALA-ALA-ALA-CHO 5mpa: 26S proteasome in presence of ATP (s2) 5mp9: 26S proteasome in presence of ATP (s1)
All the chains used from these complexes all have the same sequence pointing to:: Proteasome subunit beta type-3 UniProtKB accession: P25451
['4y84_X|4y84_X|train', '5l5e_X|5l5e_X|train', '6huu_J|6huu_J|train', '4qby_J|4qby_J|train', '4ya9_J|4ya9_J|train', '5mp9_k|5mp9_k|test', '5mpa_k|5mpa_k|test']
4y84: Yeast 20S proteasome in complex with N3-A(4,4-F2P)nLL-ep 5l5e: Yeast 20S proteasome with human beta5i (1-138) and human beta6 (97-111; 118-133) in complex with carfilzomib 6huu: Yeast 20S proteasome with human beta2c (S171G) in complex with 29 4qby: yCP in complex with BOC-ALA-ALA-ALA-CHO 4ya9: Yeast 20S proteasome beta2-H114D mutant in complex with Ac-LAD-ep 5mp9: 26S proteasome in presence of ATP (s1) 5mpa: 26S proteasome in presence of ATP (s2)
All the chains used from these complexes all have the same sequence pointing to:: Proteasome subunit beta type-4 UniProtKB accession: P22141
Train and test mix: [('train', 190, '4y84_X'), ('train', 190, '5l5e_X'), ('train', 190, '6huu_J'), ('train', 190, '4qby_J'), ('train', 190, '4ya9_J'), ('test', 190, '5mp9_k'), ('test', 190, '5mpa_k')]
[('train', 155, '3von_E'), ('test', 155, '3von_b'), ('test', 155, '3von_p'), ('test', 155, '3von_i')]
[('train', 190, '6hed_4'), ('train', 190, '6hec_5'), ('train', 190, '6he8_4'), ('train', 190, '6he9_3'), ('train', 190, '6he7_6'), ('test', 190, '6he8_k'), ('test', 190, '6hed_h'), ('test', 190, '6hea_i'), ('test', 190, '6hea_h'), ('test', 190, '6he9_i')]
[('train', 190, '3mg8_I'), ('train', 190, '4qlq_W'), ('train', 190, '6huv_I'), ('train', 190, '5fga_W'), ('train', 190, '4qby_W'), ('test', 190, '5mpa_j'), ('test', 190, '5mp9_j')]
[('train', 190, '5lf1_b'), ('train', 190, '5lf1_B'), ('test', 190, '5gjq_j')]
[('train', 190, '1iru_R'), ('test', 190, '5gjq_k')]
Train and validation mix: [('train', 190, '5lf0_W'), ('train', 190, '5m32_I'), ('train', 190, '5le5_I'), ('train', 190, '5lf1_I'), ('train', 190, '5lf3_I'), ('valid', 190, '5gjq_q')]
test and validation mix: []
PDB ids to be removed beacause of the mix: ['4y84_X', '5l5e_X', '6huu_J', '4qby_J', '4ya9_J', '5mp9_k', '5mpa_k', '3von_E', '3von_b', '3von_p', '3von_i', '6hed_4', '6hec_5', '6he8_4', '6he9_3', '6he7_6', '6he8_k', '6hed_h', '6hea_i', '6hea_h', '6he9_i', '3mg8_I', '4qlq_W', '6huv_I', '5fga_W', '4qby_W', '5mpa_j', '5mp9_j', '5lf1_b', '5lf1_B', '5gjq_j', '1iru_R', '5gjq_k', '5lf0_W', '5m32_I', '5le5_I', '5lf1_I', '5lf3_I', '5gjq_q']
Length refers to the number of entries pointing to the same protein sequence
Total number of chains: 37428 Total number of unique chains 15640 length 1 5845 length 2 4308 length 3 1307 length 4 1895 length 5 2264 length 6 8 length 7 8 length 8 4 length 9 0 length 10 1 length 11 0 length 12 0
Number of same sequence pointing to different EC numbers: 1 [('train', 201, '6giq_e'), ('train', 152, '6giq_E'), ('train', 152, '6giq_P')]