MDPOW icon indicating copy to clipboard operation
MDPOW copied to clipboard

Automated Solvation Shell Analysis

Open cadeduckworth opened this issue 2 years ago • 14 comments

cadeduckworth avatar Jan 08 '23 23:01 cadeduckworth

Codecov Report

Merging #227 (1f62fa6) into develop (cafb300) will decrease coverage by 1.45%. The diff coverage is 0.00%.

:exclamation: Current head 1f62fa6 differs from pull request most recent head 61f53b9. Consider uploading reports for the commit 61f53b9 to get more accurate results

@@             Coverage Diff             @@
##           develop     #227      +/-   ##
===========================================
- Coverage    80.15%   78.70%   -1.45%     
===========================================
  Files           15       16       +1     
  Lines         1900     1935      +35     
  Branches       291      295       +4     
===========================================
  Hits          1523     1523              
- Misses         286      321      +35     
  Partials        91       91              
Impacted Files Coverage Δ
mdpow/workflows/solvations.py 0.00% <0.00%> (ø)

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov[bot] avatar Jan 08 '23 23:01 codecov[bot]

Lots of conflicts to resolve, @cadeduckworth !

orbeckst avatar Apr 04 '23 02:04 orbeckst

@orbeckst The following is a brief overview of the new manual EnsembleAnalysis workflow based on the changes I made to the original SolvationAnalysis: (the changes to mdpow.analysis.solvation are in the most recent commit`)

I made changes that allow a user to input a max distance, and obtain all of the solute-solvent interactions within that distance (distance and atom/atomgroup info included), per molecule, per solvent, per interaction, per lambda.

This is a very rough draft, especially for the plot, but the workflow, results DataFrame, and FacetGrid get the main idea across for what I am going for.

SAMPL9 Data, FF = GAFF, Molecule = S08

Workflow: Screen Shot 2023-04-05 at 12 37 38 PM

Results DF: (it might be better to break the *_ix lists up into more columns) Screen Shot 2023-04-05 at 12 37 55 PM

Figure (FacetGrid): (please excuse the quality, I am trying to work out the kinks in the .map() function) Screen Shot 2023-04-05 at 12 30 41 PM

cadeduckworth avatar Apr 05 '23 17:04 cadeduckworth

The manual workflow as shown is a good mini tutorial and looks already very good.

Presenting the data will be more of a challenge—the distance KDEs are naturally cut off at the cutoff, which makes it look a bit weird/incomplete. And once the solute disappears, the N becomes quite big and swamps the small numbers. Perhaps try a log plot?? Or normalize??? Also, are the large numbers compatible with what you get for a sphere of radius with the cutoff at the density of water or octanol/other solvent?

orbeckst avatar Apr 05 '23 19:04 orbeckst

The distance histograms are interesting. You can see how molecules start to overlap with the disappearing solute.

orbeckst avatar Apr 05 '23 20:04 orbeckst

@orbeckst Although there is a large step size specified, from a theoretical perspective I interpreted the violins as a stepwise visual contribution to confirmation that things are going according to plan in the context of turning off intermolecular interactions. I might be thinking about it wrong, but the solute or solvent is essentially being electrostatically and otherwise alchemically removed from the system which is positively indicated by the increased equilibration of solvent molecules, right?

On that note, I think seeing that there is not much change when turning of Coulombic interactions for neutral solute/solvent is a good indication, and then seeing the gradual increase of number of solvent molecules at a decreased distance when turning off VDW in that sense is also a good indication of sufficient sampling, IFF, this holds up for other cases with more frames included.

Perhaps I should run a simulation with one of the charged molecules that was excluded for logP calculations from one of the SAMPL sets and see what happens if I can get GROMACS to comply?

EDIT: paragraph spacing, wording

After thought: a nearest neighbors calculation or comparison of individual lambdas to a population distribution could help later in identifying when some systems are not behaving?

cadeduckworth avatar Apr 06 '23 03:04 cadeduckworth

@orbeckst

I will look into approximately homogenous equilibration of solvents to try to confirm the data for more completely sampled datasets when I get the chance (those lambdas for VDW closer to 1, when the solute is essentially removed). I apologize if I did not use the correct terms there, it has been a while since I covered equilibrium chemistry, but I think I understand what you are getting at.

cadeduckworth avatar Apr 06 '23 03:04 cadeduckworth

@orbeckst

I ran a more complete analysis of the same SAMPL9 GAFF molecule (S08) with a step size of 500 and cleaned up the plots a bit. This is the product of what I have been playing with using cairosvg and svgutils instead of sns.FacetGrid, which, outside of .catplots, when initialized directly, pretty much requires that axes be shared for the same data being analyzed, which is exactly the same as sharing axes in .catplots.

So, this is the SVG converted to pdf. The width ratios are slightly off, which affects the placement of the Mol object, but I cleaned up the scale on the y-axes as well as the height/width aspect ratio. Fixing the axes/scales and including much more data seems to give a better look at what is going on.

If we do eventually go with a stacked plot scheme, then I will reserve less of the right side and reduce it to a single Mol object and legend to make it look better.

S08.pdf

cadeduckworth avatar Apr 06 '23 04:04 cadeduckworth

From earlier this morning before making changes to SolvationAnalysis:

Using the original SolvationAnalysis class, which summed the number of solvent molecules within a specified first and second cutoff distance, separately, I made a time series of the number of solvent molecules. The concept does not make much sense since each lambda simulation starts from the same relaxed equilibrated MD sim(?), but if I can map the time series to the lambda values then it might be useful as an option later on for something. Note: this is using very under sampled data, large step size, but also shows 95% CI

Screen Shot 2023-04-06 at 12 00 55 AM

cadeduckworth avatar Apr 06 '23 05:04 cadeduckworth

For the S08 data, the number of solvent atoms is initially 60,000 — there aren't that many waters in the box, are they?

Can you do the following calculation: Given the standard density of water at the simulation conditions, how many water molecules would you expect to find in a volume roughly equivalent to (1) the solute, (2) the solute with its surface blown up by 4 Å? A rough physicist estimate will suffice as a start, e.g., approximating the molecule as a cylinder or sphere...

Then compare the estimate to the numbers from solvation analysis.

orbeckst avatar Apr 07 '23 01:04 orbeckst

I am not sure if the time series contain more interesting information than the averages or distributions — at least in equilibrium. Time series in equilibrium are good as a diagnostic to check if the system is likely sampling equilibrium: if the fluctuations are around a stable mean then it might be equilibrium...

orbeckst avatar Apr 07 '23 01:04 orbeckst

I don't think that we need to add the molecular structure next to the solvation plots unless you want to highlight specific atoms. Otherwise I'd expect a molecular structure to be included in an overview panel and the reader can use the ID to cross-reference.

orbeckst avatar Apr 07 '23 01:04 orbeckst

@orbeckst I am working on some calculations/estimates, but for clarification, the 'counts' bar plot is counting each solvent atom that is within the cutoff distance near any solute atom non-exclusively, so multiple solvent atom - solute atoms distances are being recorded and counted, meaning that solvent atom A near solute atom B and solute atom C are counted separately. I believe this is why the numbers are so high. There could be an easy way to count each solvent atom only once with a minor adjustment to the script. Additionally, we could still record all of the data, and only count each solvent molecule once in the plot by only counting the unique resids.

cadeduckworth avatar Apr 08 '23 15:04 cadeduckworth

Each atom should only be counted once. There's a single closest interaction between a solvent atom and the solute (barring coincidental equidistant triangles).

orbeckst avatar Apr 08 '23 18:04 orbeckst

@cadeduckworth is this something you'd want to complete for a MDPOW paper?

Can you resolve the conflicts so that one can look at the changes in the context of the latest version of the code?

orbeckst avatar Oct 14 '24 20:10 orbeckst