hh-suite icon indicating copy to clipboard operation
hh-suite copied to clipboard

[Suggestion] Change the Wiki's recommendation for multi-domain proteins

Open apcamargo opened this issue 3 years ago • 2 comments

In the Wiki it is stated:

For long sequences, it may therefore be of advantage to first search the PDB or the SCOP domain database and then to cut the query sequence into smaller parts on the basis of the identified structural domains. Pfam or CDD are - in our opinion - less suitable to determine domain boundaries.

I'm not sure if PDB is a good choice for multi-domain proteins though, as it contains some unprocessed polyproteins that will usually have lower E-values than each individual domain (eg.: https://www.rcsb.org/structure/2IJD).

Also, is there any specific reason for Pfam to be less suitable for boundaries? I've been using it together with SCOP and got good results.

apcamargo avatar May 15 '21 19:05 apcamargo

Thank you for the remark. Answer from @soeding :

"Many Pfam domain families were founded when no structures of member was yet available. Oftentimes, the domain boundaries defined by sequence-based methods have been quite inaccurate, comprising fractions of a domain or domains-and-a-half etc. Pfam has historically be very slow in updating their Pfam family definition to harmonize with the domain boundaries elucidated by protein structure determination. Therefore, Pfam is less suited to determine boundaries of structural/functional domains than CATH / SCOP / ECOD based on the PDB."

martin-steinegger avatar May 17 '21 09:05 martin-steinegger

Thanks! I got it now!

I managed to get the download links for the HHPred databases, so I can use SCOPe and ECOD now. Regardless, my only concern is that PDB contains some precursors and we shouldn't just expect the matches to be unit domains (unless you remove precursors and polyproteins from the database beforehand).

apcamargo avatar May 17 '21 22:05 apcamargo