solvation-analysis
solvation-analysis copied to clipboard
Should we allow custom functions for calculating solvation shells at the RDF step?
I am thinking that we might want to add more flexibility at the level above analysing the RDF itself. Ie, should we allow custom functions to calculate shells that don't rely on the RDF?
This would allow generalisation of the problem to any classifier which I hope would find uses in analysing more complex solvation behaviour. More complex functions like NNs may be able to identify second solvation shells and correlated motions better than a simple 1D function (the RDF).
Our current workflow goes:
trajectory -> RDF() -> Radii() -> Solvation_data
Really the RDF and Radii steps can be merged to make it look like:
trajectory -> some_classifier() -> Solvation_data
We currently allow flexibility at the Radii step, but not at the RDF step, ie the user cannot provide a non-RDF based method to then determine radii.
Do we think this would be a good addition? It is a bit complex and would require a bit of a refactor so I would be happy to push down the track.
I am thinking that perhaps we turn the middle step into a class that must be subclassed so that it can be easily switched out. i.e the user would provide the implementation they wanted. The default setup we use at the moment would then look like:
class RDF_Univariate_Spline_Kernel(SolvationKernelBase):
def compute():
...
return radii
class Solution(solvation_kernel=RDF_Univariate_Spline_Kernel):
solvation_kernel.compute()
...
I do like this idea. So to be clear, the solvation shell would still be determined by a distance cutoff, we would just be changing how that distance can be calculated?
What do you imagine the input of a compute()
call would be? The central solute
Atom and perhaps the surrounding Solvents
?
EDIT: nevermind, I suppose the distance determination kernel would only need to be run once per solvent. My main concern would be speed. The inner loop is executed n_frames * n_solutes times, so it could end up being quite expensive. Right now we send this down to C primitives so it's not an issue. I could imagine a poorly optimized kernel eating up a ton of time, but I suppose thats an issue for the user.
Before we implement this, I think we should have a really solid RDF-based kernel. The kernel I implemented is mediocre. The kernels you mentioned in issue #54 may solve this if they work well!
Sorry should have been clearer. I'm envisioning any arbitrary classifier function but I imagine distance based ones will dominate.
This just allows you to add more complexity if required. Say for example you want to require a certain distance and some other constraint.
Yeah it just needs to calculate the data that goes into the final dataframe by some means.
Performance wise it totally depends on what your classifier kernel is and how it's implemented, so is really a user problem. Could be super fast (think JAX)!
This is complex so I'm not in a rush but just thought it could be a fun idea.
WRT #54 I have dropped the PyTorch kernel for now but have a play with the notebook once it's merged and see the kinds of results that find_peaks can provide. Seemed to work well for me. We can always add it as a formal kernel if you like it.
So would the input to an arbitrary kernel be the whole universe? Or would it be a large set of molecules surrounding each solute?
I'm wondering if we'd want to have the kernel loop through all of the solute one-by one or pass it the universe and have it return the solvation shells for each solute.
I think this would be an awesome feature for v0.3. I imagine that incorporating things like angle constraints could be really helpful, especially for identifying hydrogen bonds.
At its core I think only the universe
and one selection string need to be provided. This is the setup we use now, but an arbitrary kernel could of course take more arguments if it wanted.