mdanalysis Initial Attempt at PR for issue #3603

Would greatly appreciate some feedback, as this is just meant to be a rough version (+ my first time doing a PR, so, hopefully it's in the right place). So that I know, what's the best way to discuss changes and get feedback?

It's very rough at the moment, cause I have a lot of questions that I think it would be good to clear up (I think I'll try and put another message in the chat of issue #3603 summarising). I'm not expecting this to be merged immediately, and so will properly fill this out later I imagine (i.e. the checkboxes and the certification), my eyes just need a break from reading through things.

Fixes #3603

Changes made in this Pull Request:

Changed test_lineardensity.py from using a system of 5 water molecules to 3 systems (neutral particles, charged particles and charged dimers) that better demonstrate each aspect of Linear Density analysis. The systems consist of 100 atoms/dimers of mass 1 in a cubic box of side length 1 Angstrom.
More detail will be in the comments of issue #3603 in an hour or 2, hopefully

PR Checklist

[x] Issue raised/referenced?
[x] Tests updated/added?
[ ] Documentation updated/added?
[x] package/CHANGELOG file updated?
[ ] Is your name in package/AUTHORS? (If it is not, add it!)

Developers Certificate of Origin

I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.

📚 Documentation preview 📚: https://mdanalysis--5013.org.readthedocs.build/en/5013/

Apr 05 '25 21:04 EchoFia

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 93.58%. Comparing base (1cdb055) to head (b541d3f). :warning: Report is 66 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #5013      +/-   ##
===========================================
- Coverage    93.61%   93.58%   -0.03%     
===========================================
  Files          177      189      +12     
  Lines        21894    22960    +1066     
  Branches      3095     3095              
===========================================
+ Hits         20495    21488     +993     
- Misses         946     1019      +73     
  Partials       453      453

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Apr 07 '25 16:04 codecov[bot]

I'm not exactly sure what is happening with errors, my code was committed fine, but then I tried to update the AUTHORS (which worked initially) but it failed like 4 times over. Any idea what is going wrong?

Also, thanks for all the feedback! I think I've fixed all of them, and have used black to format things nicer (there was a slight issue with the version but it seems it was fine).

Apr 09 '25 19:04 EchoFia

In principle there's no problem with tests checking that the code still produces the same answers. These are regression tests and they are valuable.

Having additional correctness tests, which check if a known correct value is obtained, is great, too. However, we don't want to write the same code for the calculation that we already have in MDAnalysis. Instead, you want to compare either against results from another trustworthy source or against analytical results. In the case here, I'd suggest to generate systems that should be homogeneous and then test that the computed densities are close to constant. You could generate a system on the fly (following https://github.com/MDAnalysis/mdanalysis/issues/3603#issuecomment-1090471051 ) at the density of water with

N = 1000        # number of particles
N_frames = 500  # number of trajectory frames

# 1.0 bulk density of water in Å**-3
water_density = mda.units.convert(1.0, "water", "A^{-3}")
V = N/water_density
L = V**(1/3)
print(f"cubic box with L={L} Å and N={N} particles")
box = np.array([L, L, L, 90, 90, 90])

# generate random uniformly distributed coordinates in the box
rng = np.random.default_rng()
positions = rng.uniform(low=[0,0,0], high=box[:3], size=(N_frames, N, 3))

# create a minimalist universe 
u = mda.Universe.empty(n_atoms=N, trajectory=True)
# update the in-memory representation with our generated data
u.trajectory.set_array(positions, order="fac")
u.trajectory.dimensions_array[:] = box

# set masses, charges, etc as needed
# ...

You now know the density of your system

density = u.atoms.n_atoms / u.trajectory.ts.volume
print(f"density {density} Å**-3")

and because you uniformly sampled space, on average the density should be constant everywhere. Now you know that your calculated particle density (from linearDensity) should be close to this number. Similarly, your mass density should be mass * density and your charge density q * density where mass and q is the mass and charge for a particle. To make a neutral system with charges, I'd just give a charge of +q to half the particles and -q to the other half and on average the charge should be 0.

You will not get the exact numbers because of randomness and incomplete sampling so you will have to play with the tolerances for your checks. Maybe they'll only be within a relative tolerance of 1e-3... but that's something to try.

Apr 12 '25 00:04 orbeckst

Ah, okay, this makes sense. Thanks for clearing that up, I was questioning why I was rewriting the code, but I figured that it was just improving reproducibility.

I don't really understand why the system needs to be set at the density of water, as it seems like this just scales the system without changing any properties. Am I missing something here, or is it maybe just to standardise things?

I am also concerned with how the groupings work with this. I have been taking the groupings to be random choices of atoms, which means that their centre of mass is more likely to be in the middle than at the extremes, and as such, the distribution won't follow a uniform (nor are there enough atoms per residue/etc... for it to follow the central limit theorem, I fear). If it is a numbers game, I could simply increase the number of atoms, and then the groupings should more closely follow a normal distribution. I'll code it up and see what the required rtol would have to be, to see if it's reasonable.

Apr 12 '25 07:04 EchoFia

Also, can I ask for a little further clarification on why my code doesn't match the requirements. My understanding is that my code is done almost independently of the module (admittedly, I should have used the masses and charges straight from the variable 'masses' and 'charges' rather than u.atoms.masses). As such, the code is quite simple and calculates the totals and the densities using relatively simple equations.

My perspective is that the values calculated will always be the correct values by the definition of density, and that if MDAnalysis is ever erroneous and giving wrong values, it will fail this test as the test is not dependent on the module. If there's a problem of lacking repeatability, I could document running millions of test cases against the current version of LinearDensity (which is assumed to give correct answers), or against a different software like VMD maybe to ensure it always gives the correct answers. I can also add the checks above, such as homogeny checks for 'groupings=atoms', as well as other potentially other checks, such as using different distributions for coords and seeing if the densities is as expected or cases where we expect zeros. Essentially, the aim would be to generate data that is reliably correct (as upon initial review, I don't immediately see any easy way to find lots of data that is reliable).

I'm coming from a science background, so this makes sense to me in terms of statistical repeatability, reproducibility and reliability, but I imagine the same concepts in computer science could be potentially slightly different in what they're looking for. Any ideas where the error in my thinking is, or am I just completely wrong here?

Apr 12 '25 07:04 EchoFia

Your code is too complicated for a simple test and thus will become a maintenance burden. Keep it as simple as possible.

It doesn’t have to be the density of water but you have to choose some value and the density of water is a typical value in soft matter.

Apr 12 '25 15:04 orbeckst

Hi, been away for a few days so just coming back to this now. I'm still a little confused by some parts, I think, so I'll try and reword my questions more concisely:

From issue #3603, defining what 'independent of LinearDensity' means explicitly. Does this mean it's okay to use the functions used by LinearDensity (i.e. centre_of_mass) or should it be from first principles? I ask, because if it's the former, the new test will just be using all the same functions and feels like it's just repeating the code, which seems redundant.
Would you be able to expand on what aspects of my code are over complicated? To me, it appears that our codes are quite similar, except I've written it independent of the module (which can be changed, based on answer to Q1), and that I have written code for different groupings (which is the bulk of the complexity).
Is it possible to get feedback on the problem I pointed out with groupings and how they will not be uniformly distributed due to the centres of mass being the sum of uniformly distributed variables, which will make it approach a normal distribution (CLT)?
Is it possible to get feedback on the second comment of my previous 2 (one starting 'Also, can I ask...')?

Apr 16 '25 14:04 EchoFia

@PicoCentauri (EDIT: ~~as the author of~~) as one of the developers who worked on LinearDensity, do you have a moment to chime in here?

@BFedder as the author of the issue #3603 can you comment, please?

Apr 16 '25 15:04 orbeckst

Is it possible to get feedback on the second comment of my previous 2 (one starting 'Also, can I ask...')?

Sorry, which one do you mean? Please just include your questions or links.

We really only have minutes to look at PRs and if I need to scroll to find something, I've pretty much spent all my time that I had. Please make it easy for us to answer.

Apr 16 '25 15:04 orbeckst

My perspective is that the values calculated will always be the correct values by the definition of density, and that if MDAnalysis is ever erroneous and giving wrong values, it will fail this test as the test is not dependent on the module. If there's a problem of lacking repeatability, I could document running millions of test cases against the current version of LinearDensity (which is assumed to give correct answers), or against a different software like VMD maybe to ensure it always gives the correct answers. I can also add the checks above, such as homogeny checks for 'groupings=atoms', as well as other potentially other checks, such as using different distributions for coords and seeing if the densities is as expected or cases where we expect zeros. Essentially, the aim would be to generate data that is reliably correct (as upon initial review, I don't immediately see any easy way to find lots of data that is reliable).

I'm coming from a science background, so this makes sense to me in terms of statistical repeatability, reproducibility and reliability, but I imagine the same concepts in computer science could be potentially slightly different in what they're looking for. Any ideas where the error in my thinking is, or am I just completely wrong here?

Sorry, I meant this one, on the concept of 'trustworthy data'. I'll summarise here:

No immediately obvious database of 'trustworthy data', so would it be possible to confirm my code's results by running plenty of checks as to make my data 'trustworthy', or in other words, more reliable and repeatable (and by my doing it, more repeatable)?

Apr 16 '25 16:04 EchoFia

"Trustworthy" in the sense that you convinced yourself and others that these are the correct numbers. Sometimes this is generating the same output with different independent codes, sometimes that is comparing to an analytical solution, sometimes it's what you say, running something many times (but this will normally not reveal systematic errors in your code). Think of it as doing research for a paper (just on much smaller scale): You have to convince your readers that you're right.

Apr 16 '25 16:04 orbeckst

Hm, agree with @orbeckst that the code is currently too complicated and will lead to maitenence problems down the line. Would agree that it would be better for maitenence to do it as suggested with the density of water.

Apr 22 '25 10:04 talagayev

I am also concerned with how the groupings work with this. I have been taking the groupings to be random choices of atoms, which means that their centre of mass is more likely to be in the middle than at the extremes, and as such, the distribution won't follow a uniform (nor are there enough atoms per residue/etc... for it to follow the central limit theorem, I fear). If it is a numbers game, I could simply increase the number of atoms, and then the groupings should more closely follow a normal distribution. I'll code it up and see what the required rtol would have to be, to see if it's reasonable.

I would generate a uniformly distributed fluid of di-atomic molecules X-X (at a fixed distance d) where the center of mass is sampled randomly from a uniform distribution and the orientation is uniformly sampled from the unit sphere. I would not worry about overlaps. With this system you then know that center-of-mass density is (approximately) constant. I am pretty sure that this is also true for the density of the individual X.

Apr 23 '25 21:04 orbeckst

@EchoFia do you think you're going to continue on this PR?

Jun 26 '25 20:06 orbeckst

Apologies, just realised I never responded here. When I opened #3603, the issue in my mind wasn't that the expected values were hardcoded, it was that a) the set of expected values were calculated with the code meant to be tested and b) the test system is not meaningful for charge densities of residues, segments, or fragments because all these groupings are neutral here.

As @orbeckst has said, for example in https://github.com/MDAnalysis/mdanalysis/pull/3572#discussion_r843373578, problem a) is of lower priority because these tests at least tell us if something breaks in the future. The more important aspect of #3603 is problem b).

Your solution, while rigourous, seems to me to be rather more complicated than required to address the issue at hand. I don't think it's necessary to randomly create a Universe every time this test is run. I believe a better solution would be to use your code to generate new test universes (that match the criteria mentioned by other reviewers of this PR) once and add these to MDAnalysisTests.datafiles. They could then be loaded for comparison to pre-calculated expected values. This would reduce the complexity of your PR and be closer in line with the original behaviour of the tests while still addressing problem b) (and potentially problem a) depending on how you source the expected values).

Jun 27 '25 15:06 BFedder

Thank you for all your work here @EchoFia . Because there hasn't been activity in a while, I am closing. If you want to pick it up again, please say so and we can reopen.

Oct 13 '25 19:10 orbeckst

mdanalysis mdanalysis copied to clipboard

Initial Attempt at PR for issue #3603

PR Checklist

Developers Certificate of Origin

Codecov Report

mdanalysis
mdanalysis copied to clipboard