torchdrug icon indicating copy to clipboard operation
torchdrug copied to clipboard

SubcellularLocalization Dataset error

Open chrismun opened this issue 2 years ago • 1 comments

".../protein.py", line 301, in from_sequence raise ValueError("Invalid sequence %s" % sequence) ValueError: Invalid sequence MALAVRVVYCGAUGYKPKYLQLKEKLEHEFPGCLDICGEGTPQVTGFFEVTVAGKLVHSKKRGDGYVDTESKFRKLVTAIKAALAQCQ

Error loading the SubcellularLocalization dataset, I get a similar error with the BinaryLocalization dataset, however other datasets, such as Stability, work for me.

chrismun avatar Jun 15 '23 13:06 chrismun

The problem is because of the invalid amino acid types occur inside the protein seqeunces I guess. While creation of Protein data object, it is not handled. I added skip block for the unknown aa's inside the protein sequences.

beyzoskaya avatar Sep 14 '25 13:09 beyzoskaya