torchdrug
torchdrug copied to clipboard
SubcellularLocalization Dataset error
".../protein.py", line 301, in from_sequence
raise ValueError("Invalid sequence %s" % sequence)
ValueError: Invalid sequence MALAVRVVYCGAUGYKPKYLQLKEKLEHEFPGCLDICGEGTPQVTGFFEVTVAGKLVHSKKRGDGYVDTESKFRKLVTAIKAALAQCQ
Error loading the SubcellularLocalization dataset, I get a similar error with the BinaryLocalization dataset, however other datasets, such as Stability, work for me.
The problem is because of the invalid amino acid types occur inside the protein seqeunces I guess. While creation of Protein data object, it is not handled. I added skip block for the unknown aa's inside the protein sequences.