biozernike icon indicating copy to clipboard operation
biozernike copied to clipboard

Calculating moments between two PDBs

Open thorstenwagner opened this issue 1 year ago • 3 comments

Hi!

Thanks for the lib :-)

I would like to calculate a similarity score between two PDBs (like rcsb but for any protein combination) . Is there any documentation how to do that with this library?

Best, Thorsten

thorstenwagner avatar Oct 20 '23 10:10 thorstenwagner

Hi Thorsten,

The simplest way is to use InvariantNorm.compareInvariants, as in the example here: https://github.com/biocryst/biozernike/blob/bb21b6f59fce4ca82276d6d601df7ef38f74b849/src/test/java/org/rcsb/biozernike/DescriptorTest.java#L163C35-L163C35

It compares moments between a coordinate structure and an EM one, but comparing two PDB structures would be the same. You will get a distance score that will negatively correlate with the rcsb's similarity score. The correlation will not be perfect though, as that one is a weighted distance, with weights trained on a set of assemblies.

Sample weights can be found here https://github.com/biocryst/biozernike/blob/master/src/test/resources/descriptor.properties (these were trained on CATH subsets if I remember correctly). This file is used to initialise the Descriptor class, but I just realised that we're not really showing how to use the Descriptor class in this library beyond initialisation.

Basically using all moments (as in compareInvariants) is a decent starting point, but you can tune the distance measure by fitting the moment weights to maximise retrieval metrics on your dataset, if needed.

biocryst avatar Oct 23 '23 10:10 biocryst

Thank you for your response :-)

So, that means I could also use a molmap (mrc) to calculate the similarity? That would be even better, because I actually want to know how similar two molmaps (10A) are which I converted from PDB.

I think this code from your test would read the map?

InputStream is = DescriptorTest.class.getResourceAsStream("/emd_3186.map");
		Volume volumeEM = VolumeIO.read(is, MapFileType.MRC);
		volumeEM.positivize();
		volumeEM.applyContourAndNormalize(0.0176, 757);
		volumeEM.updateCenter();
		volumeEM.setRadiusVarMult(1.64);

There are some constants you set. Can you comment on setRadiusVarMult?

thorstenwagner avatar Oct 23 '23 11:10 thorstenwagner

The descriptors are based on volumes (maps), so yeah you could use maps directly. There are a few things to keep in mind, resulting from the fact that my main workflow was coordinates->simulated density->descriptors

  • The VolumeIO functionality was added for debugging purposes and doesn't support the map formats fully (e.g., it reads maps of any dimensions, but writes only cubes starting from (0,0,0), etc).
  • In simulated density I can control the density values, but in experimental maps they are all over the place. The density values in the map affect the resulting descriptors. That's why applyContourAndNormalize is used in this example - the values less than 0.0176 are zero'ed and the density is multiplied by 757. It results in the EM density distribution +/- similar to the simulated one.
    • I guess it would be less of an issue if you compare two maps that have consistent density values between each other. Otherwise you would need some normalisation to e.g. standard deviations over the mean.
  • The descriptors are calculated on a density inside a ball. We need to somehow define the radius of that ball. If it's too small, parts of the structure are excluded, if it's too large - we are describing a lot of empty space around the object. If it's inconsistent between similar structures - we are losing accuracy.
    • The Volume class calculates two possible radii: maximum (radiusMax) and radius of gyration (radiusVar). We found that for the (simulated) coordinate structures 1.8 of the radius of gyration seems to work best.
    • That's why DEFAULT_RADIUS_MULTIPLIER is set to 1.8, but can be overridden with setRadiusVarMult.
    • The EM volumes, naturally, are not as clean as the simulated ones. So the radiusVar tends to be a bit larger, since there's less empty space. That's why in the example it was lowered to match the radius from the coordinates.

If the maps you are comparing are coming from the PDB coordinates, as you mentioned, then I guess you generated them in a similar manner (+/- how the Volume::create(..) does it), then you don't need to worry about different normalisations and radii corrections. But for arbitrary EM maps the consistency of the densities needs to be validated first before calculating/comparing descriptors.

biocryst avatar Oct 23 '23 12:10 biocryst