MDPOW
MDPOW copied to clipboard
Automated Solvation Shell Analysis
Codecov Report
Merging #227 (1f62fa6) into develop (cafb300) will decrease coverage by
1.45%
. The diff coverage is0.00%
.
:exclamation: Current head 1f62fa6 differs from pull request most recent head 61f53b9. Consider uploading reports for the commit 61f53b9 to get more accurate results
@@ Coverage Diff @@
## develop #227 +/- ##
===========================================
- Coverage 80.15% 78.70% -1.45%
===========================================
Files 15 16 +1
Lines 1900 1935 +35
Branches 291 295 +4
===========================================
Hits 1523 1523
- Misses 286 321 +35
Partials 91 91
Impacted Files | Coverage Δ | |
---|---|---|
mdpow/workflows/solvations.py | 0.00% <0.00%> (ø) |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
Lots of conflicts to resolve, @cadeduckworth !
@orbeckst
The following is a brief overview of the new manual EnsembleAnalysis
workflow based on the changes I made to the original SolvationAnalysis
: (the changes to mdpow.analysis.solvation
are in the most recent commit`)
I made changes that allow a user to input a max distance, and obtain all of the solute-solvent interactions within that distance (distance and atom/atomgroup info included), per molecule, per solvent, per interaction, per lambda.
This is a very rough draft, especially for the plot, but the workflow, results DataFrame, and FacetGrid get the main idea across for what I am going for.
SAMPL9 Data, FF = GAFF, Molecule = S08
Workflow:
Results DF: (it might be better to break the *_ix
lists up into more columns)
Figure (FacetGrid): (please excuse the quality, I am trying to work out the kinks in the .map()
function)
The manual workflow as shown is a good mini tutorial and looks already very good.
Presenting the data will be more of a challenge—the distance KDEs are naturally cut off at the cutoff, which makes it look a bit weird/incomplete. And once the solute disappears, the N becomes quite big and swamps the small numbers. Perhaps try a log plot?? Or normalize??? Also, are the large numbers compatible with what you get for a sphere of radius with the cutoff at the density of water or octanol/other solvent?
The distance histograms are interesting. You can see how molecules start to overlap with the disappearing solute.
@orbeckst Although there is a large step size specified, from a theoretical perspective I interpreted the violins as a stepwise visual contribution to confirmation that things are going according to plan in the context of turning off intermolecular interactions. I might be thinking about it wrong, but the solute or solvent is essentially being electrostatically and otherwise alchemically removed from the system which is positively indicated by the increased equilibration of solvent molecules, right?
On that note, I think seeing that there is not much change when turning of Coulombic interactions for neutral solute/solvent is a good indication, and then seeing the gradual increase of number of solvent molecules at a decreased distance when turning off VDW in that sense is also a good indication of sufficient sampling, IFF, this holds up for other cases with more frames included.
Perhaps I should run a simulation with one of the charged molecules that was excluded for logP calculations from one of the SAMPL sets and see what happens if I can get GROMACS to comply?
EDIT: paragraph spacing, wording
After thought: a nearest neighbors calculation or comparison of individual lambdas to a population distribution could help later in identifying when some systems are not behaving?
@orbeckst
I will look into approximately homogenous equilibration of solvents to try to confirm the data for more completely sampled datasets when I get the chance (those lambdas for VDW closer to 1, when the solute is essentially removed). I apologize if I did not use the correct terms there, it has been a while since I covered equilibrium chemistry, but I think I understand what you are getting at.
@orbeckst
I ran a more complete analysis of the same SAMPL9 GAFF molecule (S08) with a step size of 500 and cleaned up the plots a bit.
This is the product of what I have been playing with using cairosvg
and svgutils
instead of sns.FacetGrid
, which, outside of .catplots
, when initialized directly, pretty much requires that axes be shared for the same data being analyzed, which is exactly the same as sharing axes in .catplots
.
So, this is the SVG converted to pdf. The width ratios are slightly off, which affects the placement of the Mol object, but I cleaned up the scale on the y-axes as well as the height/width aspect ratio. Fixing the axes/scales and including much more data seems to give a better look at what is going on.
If we do eventually go with a stacked plot scheme, then I will reserve less of the right side and reduce it to a single Mol object and legend to make it look better.
From earlier this morning before making changes to SolvationAnalysis
:
Using the original SolvationAnalysis
class, which summed the number of solvent molecules within a specified first and second cutoff distance, separately, I made a time series of the number of solvent molecules.
The concept does not make much sense since each lambda simulation starts from the same relaxed equilibrated MD sim(?), but if I can map the time series to the lambda values then it might be useful as an option later on for something. Note: this is using very under sampled data, large step size, but also shows 95% CI
For the S08 data, the number of solvent atoms is initially 60,000 — there aren't that many waters in the box, are they?
Can you do the following calculation: Given the standard density of water at the simulation conditions, how many water molecules would you expect to find in a volume roughly equivalent to (1) the solute, (2) the solute with its surface blown up by 4 Å? A rough physicist estimate will suffice as a start, e.g., approximating the molecule as a cylinder or sphere...
Then compare the estimate to the numbers from solvation analysis.
I am not sure if the time series contain more interesting information than the averages or distributions — at least in equilibrium. Time series in equilibrium are good as a diagnostic to check if the system is likely sampling equilibrium: if the fluctuations are around a stable mean then it might be equilibrium...
I don't think that we need to add the molecular structure next to the solvation plots unless you want to highlight specific atoms. Otherwise I'd expect a molecular structure to be included in an overview panel and the reader can use the ID to cross-reference.
@orbeckst
I am working on some calculations/estimates, but for clarification, the 'counts' bar plot is counting each solvent atom that is within the cutoff distance near any solute atom non-exclusively, so multiple solvent atom - solute atoms distances are being recorded and counted, meaning that solvent atom A near solute atom B and solute atom C are counted separately. I believe this is why the numbers are so high. There could be an easy way to count each solvent atom only once with a minor adjustment to the script. Additionally, we could still record all of the data, and only count each solvent molecule once in the plot by only counting the unique resid
s.
Each atom should only be counted once. There's a single closest interaction between a solvent atom and the solute (barring coincidental equidistant triangles).
@cadeduckworth is this something you'd want to complete for a MDPOW paper?
Can you resolve the conflicts so that one can look at the changes in the context of the latest version of the code?