biozernike
biozernike copied to clipboard
Use for small molecules and Volume reconstruction
Hi! Thanks for the awesome library! Crunching all those numbers and functions from the paper and putting this together is amazing.
I am trying to adjust the library to use it for small molecule 3D interaction potential comparison instead of protein densities for my research. These are energetic 3d maps with positive and negative real values.
I am coming across several challenges to understand all the code and the science behind and I would like to kindly ask for some guidance :)
- What are the main things I should take care of in the code to change the type of volume ?
- I saw the similarity function was somehow “trained” but could not find any details.
- How does the normalisation process work? Is there something I would need to adjust here too?
Moreover, I was wondering if you have any example of script on how to reconstruct the volume from the moments as you showcase in figure 1 of your paper for protein 4HHB? This will greatly help to evaluate if the changes i introduce make still sense and produce valid moments.
thanks a lot and Merry Christmas! Daniel
Hi Daniel,
Thank you for your interest!
Sorry for late response, these are busy days for me. To address some of your questions I committed (very dirty) experimental code that handles reconstruction (of ligands no less). Please check out branch reconstruct_tmp, and an example in VolumeTest.testReconstruction()
You will need to save the cache file once (takes some time), and then it can be used to speed things up. The cache location is hardcoded in the ReconstructionCache class. The file will load rather slowly still (but only once per run), I had issues with faster serialization libraries so left it as is for now.
In theory Zernike moments will approximate any function, positive and negative values. But in this code we're kinda assuming only positive values are valuable, so the volume normalisation will break.
Pay attention to the Volume class: the center, radius of gyration and volume mass are the things that will definitely be affected. The first two need to be calculated on the absolute values of the density, for the mass I'm not sure - try converting/reconstructing and see what you get.
The later normalisation (as in, Zernike moment alignment by zero'ing some moment values) should work as usual.
I hope this helps.
Happy new year!
Kind regards, Dmytro.
Wow! This really helped a lot! I could expand and reconstruct the energy grids I am dealing with and they make sense. I could also superpose some molecules using the invariants! I am very grateful!
I still need to figure out how to get some good reconstructued values which are in the same scale as the input, but I think for my immediate purpose I could get this working quite well! Amazing!
Thanks a lot Dmytro.
Hi again,
it's me now trying to understand the rotation part of the code.... I have problems undertanding how the superposition works... could i get some help understanding this chunk of code below from the VolumeTest.testVectors
test, for instance? Where do these parameters 2, 2, and indsPositive come from and what do they mean? Well... in general it would be really helfpul if you could comment on what's going on on that test.
Once superposed, i extract the rotated moments somehow? I would like to compare distances taking all the coefficients (and not just the invariants) once rotated. Is that possible?
Thanks for the invaluable help. Regards,
Daniel
[(https://github.com/biocryst/biozernike/blob/6dfb2f9fefc0e87909c0ebbbc1b17e7b391fba71/src/test/java/org/rcsb/biozernike/VolumeTest.java#L353-L360)
Hi Daniel.
Sorry for the wall of text. You did ask, so...
First of all, there's no true superposition here. We were exploring ways to calculate descriptors based on Zernike moments that preserve all information about the structure and do not depend on the object orientation. To achieve this we 'normalise' the orientation. It means that we want to transform the moments consistently with rotation (so that the object remains the same) and find a transformation that is unique for any object (so that we get it whatever its starting orientation was). This is what the class InvariantNorm is doing.
InvariantNorm ligandNormalization = new InvariantNorm(ligandVolume, 15);
Now, to the uniqueness of the transformation. Let's say I have two identical structures that are oriented differently. I calculate their moments and turns out that moment M222 = 5 in the first one and M222 = -2 in the second one. I know that there exists an orientation where these two are identical, and 0 is always a valid moment value, so I make an equation like:
Rotation x M222 = 0
3 equations like this almost solves the uniqueness problem (because rotation has 3 degrees of freedom). There are many ways to pick/solve these equations, I derived one that's more or less flexible and consistent. It's implemented in InvariantNorm.computeRotations(int indZero, int indReal).
Parameter indZero defines an order from which a moment is picked that is completely zeroed (it is a complex number, so both real and imaginary parts = 0). IndReal defines an order where a moment is picked whose imaginary part is set to 0 (so it becomes a real number). 3 constrained values in total.
If both parameters are 2, it means we are taking the equivalent ellipsoid of the structure (that is defined by the moments of order 2) and orient it so that its axes are in a 'standard' position. We don't know what the standard position is beforehand - it is defined indirectly through the conditions above. (2,2) is generally the safest and most interpretable normalisation, but trying other orders may lead to more accurate results for your structures.
Now, I said above that the uniqueness problem is 'almost' solved. Take an ellipsoid, flip it 180 degrees around any axis and you get an identical ellipsoid. That's why the following line of code returns not one, but a list of 8 transformations (for all flips combined):
List<MomentTransform> ligandTransforms = ligandNormalization.getNormalizationSolution(2,2);
InvariantNorm.getConstrainedNormalizationSolution() addresses the problem "which one out of 8 do I pick" as follows (it's not an 'official' function though, just a helper I used at some point).
Luckily, there are moments that flip their sign for every ellipsoid axis flip. You can use them like a binary code: (-,-,-), (-,-,+), (-,+,-) etc. and you can say that from the 8 solutions you want the (+,+,+) one. The parameter indsPositive says which indices of the moments in the flattened moment list will flip, and only one transformation will be returned where all of these are positive.
BTW: the indices in the example are only valid for indReal=indZero=2. I didn't derive the general rule how to pick them, I just do it by looking at the data and validating.
BTW2: in the paper/RCSB search system we addressed "which one out of 8 do I pick" differently:
- average of the 8 solutions was used for the search descriptors.
- All 8 solutions are used for the alignment descriptors. Then at runtime we pick a pair of solutions that lead to the best superposition. The reason was that symmetric structures/assemblies screw up the 'binary encoding' scheme with moment signs in multitude of ways, depending on the symmetry group. Fortunately for you the small molecules do not seem to be too symmetric. At least in my tests the (+++) solution was always unique.
Finally, let's clarify what superposition here means exactly.
Suppose we have two structures S1 and S2, we calculate their moments and normalise the moments so that the criteria above are satisfied (importantly: the same set of criteria should be used for both structures!). Next:
- If you reconstruct from the normalised moments you'll get the structures that are superposed. But not in the original orientation of either of the structures.
- Normalisation is equivalent to some rotation (MomentTransform.rotation() calculates the matrix) so the reconstructed structures will be rotated R1S1, R2S2
- You can use the matrix R1^(-1)*R2 to align S2 to S1, or vice versa.
Importantly, this superposition is a side-effect of the normalisation procedure, we didn't optimise it. In most cases it can be improved by local search using whatever quality metric you'd like.
To extract the rotated moments use MomentTransform.getMoments() (structured) or MomentTransform.getFlatMoments() (flattened list).
MomentTransform.distanceTo(MomentTransform other) compares two solutions, but it implements only simple Canberra distance.
I hope this helps.
Kind regards, Dmytro.
This is super useful! It helped me understanding so many things on the paper and the code! I appreciate your time, it's very kind of you.
I have one "maybe last :)" quesntion on the search with the descriptors. I would like to implement something similar where i would have a database with ligands and descriptors to search against. Is the descriptor a plain concatenation of all the CN 2, 3, 4 and 5th invariants (+de GEO part)? And then the weighted distance metric you parameterized has weights for each coefficient?
Indeed. You calculate the moments up to a certain order (it doesn't need to be 20, we just used that number to compare to state of the art), concatenate averaged normalisations for a few orders (InvariantNorm.getInvariants() is doing that) and the geometric descriptor (Descriptor.calcGeometryDescriptor()).
You may need to simplify the geometry descriptor for small molecules. I mean, for example, kurtosis of a distance distribution with 10 points wouldn't make much sense. Maybe the radius and the nominal weight is all you need.
The geometric parameters may be used as a filter (e.g., if any of them differ by more than 20%, then it's not a hit) and then compare moments of the structures that has passed the filter. At the first iteration of your implementation you don't really need to train weights for comparing moments. A regular distance like braycurtis or canberra usually works well enough.
The weights were needed because 1) the geometric descriptor and the moment descriptor are on different scales, and we wanted one distance measure that uses them all, and 2) some moments are more important than others (e.g. the same difference in order 2 and order 20 means vastly different things). It also helps to reduce the final vector quite a bit (if you constrain the weights to >=0, many of them will be 0) - but that will depend on the dataset and the training method.