openprotein
openprotein copied to clipboard
Keys in hdf5
Hi, nice work done here. I wanted to ask that in after pre processing raw data to hdf5 file there were primary, mask and tertiary keys so this means the model training only looks at amino acid sequence but according to AlQuraishi's paper shouldn't the input be amino acid sequence + PSSM ?
Hey @maverick0004! Correct, currently this only uses the amino acid sequence. However, since the PSSM data it is in the ProteinNet data set it should be quick to include it in the hdf5/model :) Relevant code parsing the ProteinNet format is here https://github.com/OpenProtein/openprotein/blob/master/preprocessing.py#L53
@JeppeHallgren So if just taking the amino acid sequence as input aren't the predictions less accurate than the ones using sequence + PSSM as done by AlQuraishi ?