mdanalysis
mdanalysis copied to clipboard
MDAnalysis.lib.distances needs rework
The documentation of the module MDAnalysis.lib.distances
has several issues, whereof one has already been addressed by @xiki-tempula in issue #2004. I also see some points where code duplication could be reduced to make the code more DRY.
Documentation issues:
- [ ] Probably deprecated title: Fast distance array computation IMHO that seems a bit misleading. Even though the module contains methods to compute distance arrays, it also includes methods for coordinate transformation, PBC handling, and for the calculation of angles and dihedrals.
- [x] The same issue applies to the doc header, which doesn't describe the module's contents sufficiently since it only refers to distance arrays. (Fixed via PR #2070)
- [x] The methods listed as
.. autofunction: ...
in the header appear as duplicates in the docs. Furthermore, the way theautofunction
strings are formatted may not be very useful (nested square brackets without default values for kwargs). (Fixed via PR #2070) - [x] Many docstrings violate the 80 character line length limit. (Fixed via PR #2070)
- [x] The methods
augment_coordinates
andundo_augment
, which are imported fromMDAnalysis.lib._augment
, are missing from the docs. (Fixed via PR #2062) - [x] Many docstrings have issues with spelling mistakes, missing or improperly referenced parameters and methods, wrong or missing data types, and potentially misleading explanations. That includes but is not limited to issue #2004. (Fixed via PR #2070)
- [x] In general, the docstrings could be much more standardized, there's no need to have different descriptions for parameters such as box which always have the same purpose or meaning. (Fixed via PR #2070)
Code issues:
- [x] Many functions use the
_box_check()
helper function to determine the type of simulation box supplied. Thereafter, the box coordinates are transformed to the memory layout expected by the subsequently called low-level C functions. This transformation should be incorporated into_check_box()
to avoid code duplication. (Fixed via PR #2048) - [x] The docstrings of most functions state that the box must be supplied in the format
[lx, ly, lz, alpha, beta, gamma]
as returned by `Timestep.dimensions```. The_box_check
function doesn't reflect that. So ~either the requirements for the box should be less strict or~_box_check
should be stricter in that respect. (Fixed via PR #2048) - [x] Not only checking but also creating the results array (if required) should take place in the
_check_result_array()
helper function. (Fixed via PR #2048) - [x] Many functions now incorporate automatic dtype conversion, so the dtype check in the
_check_array()
helper function is now redundant and can be removed in these cases. (Fixed via PR #2048 by means of a new@check_coords()
decorator) - [x] PR #2048 introduced a subtle bug so that in certain situations, some functions change their input coordinate arrays in-place. (Fixed via PR #2083)
- [x] Many functions choke on empty input coordinates arrays (i.e., with
shape=(0, 3)
). (Fixed via PR #2083) - [x] Depending on the employed search method, the results of
capped_distance()
andself_capped_distance()
are not always numpy arrays. (Fixed via PR #2083) - [x] If no pairs are found,
_bruteforce_capped_self()
correctly returns empty pairs but unfortunately also non-empty mumbo-jumbo distances. (Fixed via PR #2083) - [x]
_bruteforce_capped()
crashes if all input coordinates are the same andbox
isNone
. (Fixed via PR #2083) - [x]
lib.distances._check_box()
should be moved tolib.util
(with underscore removed). (Fixed via PR #2114) - [x] Not all methods for
*capped_distance()
have the same cut-off criteria (sometimesdistances < max_cutoff
and sometimesdistances <= max_cutoff
). (Fixed via PR #2114) - [x] The different methods for
*capped_distance()
do not always return the same number of pairs. (EDIT: only pathological cases, won't fix) - [x]
lib.nsgrid
calculates distances in single precision, whereas all functions in lib.distances use double precision. (Fixed via PR #2114) - [x]
lib.nsgrid.PBCBox
uses an arbitrary constantEPSILON=1e-5
to determine whether a box is triclinic. This a) does not correspond to the behavior of other functions, and b) fails if a box angle is, e.g., 90.00001 degrees (or higher or negative). (Fixed via PR #2114) - [x]
lib/include/calc_distances.h
contains a lot of duplicated code (functions differing in PBC type only). (EDIT: unifying functions for different PBC types impacts performance, won't fix) - [ ] OpenMP parallelization in
lib/include/calc_distances.h
often suffers from false sharing. - [ ] In
lib.distances.distance_array()
, the inner loop (the one over theconfiguration
coordinate array) should be parallelized instead of the outer one (going over thereference
coordinate array), since often, one seeks to know the number of "neighbors" with respect to some reference atom. - [ ] Tidy the namespace, make C functions invisible, currently we have everything visible
TODO suggestions:
- [x] Documentation:
- [x] Ensure that parameters referenced in docstrings correspond to function signatures. (Fixed via PRs #2048 and #2070)
- [x] Properly format references to other methods so that links are generated. Add ~
.. seealso :
~See Also
sections where applicable. (Fixed via PR #2070) - [x] Use italics only to reference function parameters, use teletype font for variables and values in docstrings. (Fixed via PR #2070)
- [x] Use logical instead of visual markup (e.g., backticks instead of asterisks for function parameters). (Fixed via PR #2070)
- [x] Enforce 80 character lines (Fixed via PR #2070)
- [x] Standardize phrasing of repeatedly occurring parameter descriptions (Fixed via PR #2070)
- [x] Always state correct type of parameters (e.g.
numpy.ndarray
if a numpy array is expected) (Fixed via PRs #2048, #2062, and #2070) - [x] Mark optional parameters by adding , optional after the type (e.g.
box : array_like, optional
) (Fixed via PR #2070) - [x] Do not give default values for kwargs in brackets since these appear in the auto-generated function signature. (Fixed via PR #2070)
- [ ] Module Header:
- [ ] Change title to something more descriptive (e.g., "fast geometry computations" or similar)
- [x] Improve module description to better correspond to its contents (Fixed via PR #2070)
- [x] Remove ~
.. autofunction
lines~:members:
fromdistances.rst
causing duplicates
- Code:
- [x] Move box memory layout transformations to
_check_box
and ~either loosen requirements for its expected format or~ make checks stricter. (Fixed via PR #2048) - [x] Fix bugs introduced in PR #2048 and add tests assuring input arrays always remain unmodified. (Fixed via PR #2083)
- [x] For completeness, make sure that all functions also work with empty coordinate arrays (
shape=(0, 3)
) and add corresponding test cases. (Fixed via PR #2083) - [x] Ensure
capped_distance()
andself_capped_distance()
always return numpy arrays and add corresponding tests. (Fixed via PR #2083) - [x] Ensure
_bruteforce_capped_self()
returns empty distances if no pairs are found. Add corresponding tests for all*capped_distance()
functions. (Fixed via PR #2083) - [x] Fix
_bruteforce_capped()
for the case when all input coordinates are the same andbox
isNone
. (Fixed via PR #2083) - [x] Move
lib.distances._check_box()
tolib.util
(with underscore removed). (Fixed via PR #2114) - [x] Ensure that all methods used in
capped_distance()
have the same cut-off criterion (distances <= max_cutoff
). (Fixed via PR #2114) - [x] Use double precision for distance computation in
lib.nsgrid
. (Fixed via PR #2114) - [x] Fix box type detection in
lib.nsgrid
and some functions inlib.distances
. (Fixed via PR #2114) - [ ] Optimize parallelism in
lib/include/calc_distances.h
for 64 byte cachelines (avoid false sharing). - [ ] Parallelize inner instead of outer loop in
lib.distances.distance_array()
.
- [x] Move box memory layout transformations to
That's quite a lot of things to do, but I've already started working on most of the points. There are still issues to be discussed, especially the module's title ~and description and how to proceed with the requirements for boxes~.
Current version of MDAnalysis:
~0.18.1-dev~ 0.19.1-dev
@zemanj so with the openmp loops, I thought it was preferable to have as few fork/joins as possible, so ideally one for this double loop. Many it would be a good idea to ensure that the numref
loop was larger than numconf
loop. Would be interesting to see if this changes performance (ie 50 vs 1000 and 1000 vs 50)
WRT false sharing, we can just set the schedule to be the width of however many floats we can fetch at once? I would hope that the default scheduler would be doing that.
I added tidying the namespace, we currently dump everything there, which isn't useful to users (doing tab completion)
@richardjgowers Yes, we need something like lib.pbctools
or so. We should also deprecate the direct access to lib functions in the analysis
module.
With "make C functions invisible" you mean adding a couple of del
s at the end of c_distances.pyx
and c_distances_openmp.pyx
?
Regarding parallel performance, I got that already under control (dynamically reacting on array lengths etc.), just haven't had the time to wrap it up yet. Contrary to my first intuition, false sharing is either not a big deal or it is rather hard to tackle, for example when the threads in self_diatance_array()
dump their results in the output array (IIRC). Keep in mind that false sharing is only relevant in write operations, and it might sometimes compete with memory locality when reading data. I also played around with OpenMP schedules to no avail. The default schedule seems to be pretty well suited.
Hey, Just trying to take a hit at the lib.distances.distance_array() issue. This seems to be the relevant codebase where the issue is referring:
https://imgur.com/7sgIP8B
Could anybody guide me to which loop the issue is colluding or if the issue is solved and should be closed in the latest version?
Hi @tanmaymunjalz the loop in question is in the C header package/MDAnalysis/lib/include/calc_distances.h and is at line 383 406 and 431. I would make links but I'm on my phone.
Thanks! Will take a look