protr icon indicating copy to clipboard operation
protr copied to clipboard

New Protein Descriptor: Symmetric extractDC

Open discoleo opened this issue 2 months ago • 1 comments

New Protein Descriptor: Symmetric extractDC

The current extractDC is not symmetric, which generates 400 keys. This has some drawbacks:

  • Proteins smaller than 400 AA: cannot contain all keys;
  • Proteins between 400 - 1000 AA: significant number of keys will still have counts of 0 or 1;

It may be wise to implement a symmetric descriptor, where "XY" == "YX":

  • Statistical power: is likely to increase (as most counts will increase);
  • I feel that there are no functional differences between "XY" and "YX" at protein level;

The symmetric variant would have 210 keys instead of 400 keys, e.g. "AA", "AC", "AD", ..., "XY", with "X" letter before "Y"-letter. The proprotions could be normalized by dividing to (2*n-2), where n = number of AA in the protein.

It would be interesting to compare this descriptor against the current extractDC on real-life protein data sets.

discoleo avatar Apr 22 '24 16:04 discoleo