mdanalysis icon indicating copy to clipboard operation
mdanalysis copied to clipboard

MDAnalysis.lib.distances needs rework

Open zemanj opened this issue 5 years ago • 6 comments

The documentation of the module MDAnalysis.lib.distances has several issues, whereof one has already been addressed by @xiki-tempula in issue #2004. I also see some points where code duplication could be reduced to make the code more DRY.

Documentation issues:

  • [ ] Probably deprecated title: Fast distance array computation IMHO that seems a bit misleading. Even though the module contains methods to compute distance arrays, it also includes methods for coordinate transformation, PBC handling, and for the calculation of angles and dihedrals.
  • [x] The same issue applies to the doc header, which doesn't describe the module's contents sufficiently since it only refers to distance arrays. (Fixed via PR #2070)
  • [x] The methods listed as .. autofunction: ... in the header appear as duplicates in the docs. Furthermore, the way the autofunction strings are formatted may not be very useful (nested square brackets without default values for kwargs). (Fixed via PR #2070)
  • [x] Many docstrings violate the 80 character line length limit. (Fixed via PR #2070)
  • [x] The methods augment_coordinates and undo_augment, which are imported from MDAnalysis.lib._augment, are missing from the docs. (Fixed via PR #2062)
  • [x] Many docstrings have issues with spelling mistakes, missing or improperly referenced parameters and methods, wrong or missing data types, and potentially misleading explanations. That includes but is not limited to issue #2004. (Fixed via PR #2070)
  • [x] In general, the docstrings could be much more standardized, there's no need to have different descriptions for parameters such as box which always have the same purpose or meaning. (Fixed via PR #2070)

Code issues:

  • [x] Many functions use the _box_check() helper function to determine the type of simulation box supplied. Thereafter, the box coordinates are transformed to the memory layout expected by the subsequently called low-level C functions. This transformation should be incorporated into _check_box() to avoid code duplication. (Fixed via PR #2048)
  • [x] The docstrings of most functions state that the box must be supplied in the format [lx, ly, lz, alpha, beta, gamma] as returned by `Timestep.dimensions```. The _box_check function doesn't reflect that. So ~either the requirements for the box should be less strict or~ _box_check should be stricter in that respect. (Fixed via PR #2048)
  • [x] Not only checking but also creating the results array (if required) should take place in the _check_result_array() helper function. (Fixed via PR #2048)
  • [x] Many functions now incorporate automatic dtype conversion, so the dtype check in the _check_array() helper function is now redundant and can be removed in these cases. (Fixed via PR #2048 by means of a new @check_coords() decorator)
  • [x] PR #2048 introduced a subtle bug so that in certain situations, some functions change their input coordinate arrays in-place. (Fixed via PR #2083)
  • [x] Many functions choke on empty input coordinates arrays (i.e., with shape=(0, 3)). (Fixed via PR #2083)
  • [x] Depending on the employed search method, the results of capped_distance() and self_capped_distance() are not always numpy arrays. (Fixed via PR #2083)
  • [x] If no pairs are found, _bruteforce_capped_self() correctly returns empty pairs but unfortunately also non-empty mumbo-jumbo distances. (Fixed via PR #2083)
  • [x] _bruteforce_capped() crashes if all input coordinates are the same and box is None. (Fixed via PR #2083)
  • [x] lib.distances._check_box() should be moved to lib.util (with underscore removed). (Fixed via PR #2114)
  • [x] Not all methods for *capped_distance() have the same cut-off criteria (sometimes distances < max_cutoff and sometimes distances <= max_cutoff). (Fixed via PR #2114)
  • [x] The different methods for *capped_distance() do not always return the same number of pairs. (EDIT: only pathological cases, won't fix)
  • [x] lib.nsgrid calculates distances in single precision, whereas all functions in lib.distances use double precision. (Fixed via PR #2114)
  • [x] lib.nsgrid.PBCBox uses an arbitrary constant EPSILON=1e-5 to determine whether a box is triclinic. This a) does not correspond to the behavior of other functions, and b) fails if a box angle is, e.g., 90.00001 degrees (or higher or negative). (Fixed via PR #2114)
  • [x] lib/include/calc_distances.h contains a lot of duplicated code (functions differing in PBC type only). (EDIT: unifying functions for different PBC types impacts performance, won't fix)
  • [ ] OpenMP parallelization in lib/include/calc_distances.h often suffers from false sharing.
  • [ ] In lib.distances.distance_array(), the inner loop (the one over the configuration coordinate array) should be parallelized instead of the outer one (going over the reference coordinate array), since often, one seeks to know the number of "neighbors" with respect to some reference atom.
  • [ ] Tidy the namespace, make C functions invisible, currently we have everything visible

TODO suggestions:

  • [x] Documentation:
    • [x] Ensure that parameters referenced in docstrings correspond to function signatures. (Fixed via PRs #2048 and #2070)
    • [x] Properly format references to other methods so that links are generated. Add ~.. seealso :~ See Also sections where applicable. (Fixed via PR #2070)
    • [x] Use italics only to reference function parameters, use teletype font for variables and values in docstrings. (Fixed via PR #2070)
    • [x] Use logical instead of visual markup (e.g., backticks instead of asterisks for function parameters). (Fixed via PR #2070)
    • [x] Enforce 80 character lines (Fixed via PR #2070)
    • [x] Standardize phrasing of repeatedly occurring parameter descriptions (Fixed via PR #2070)
    • [x] Always state correct type of parameters (e.g. numpy.ndarray if a numpy array is expected) (Fixed via PRs #2048, #2062, and #2070)
    • [x] Mark optional parameters by adding , optional after the type (e.g. box : array_like, optional) (Fixed via PR #2070)
    • [x] Do not give default values for kwargs in brackets since these appear in the auto-generated function signature. (Fixed via PR #2070)
    • [ ] Module Header:
      • [ ] Change title to something more descriptive (e.g., "fast geometry computations" or similar)
      • [x] Improve module description to better correspond to its contents (Fixed via PR #2070)
      • [x] Remove ~.. autofunction lines~ :members: from distances.rst causing duplicates
  • Code:
    • [x] Move box memory layout transformations to _check_box and ~either loosen requirements for its expected format or~ make checks stricter. (Fixed via PR #2048)
    • [x] Fix bugs introduced in PR #2048 and add tests assuring input arrays always remain unmodified. (Fixed via PR #2083)
    • [x] For completeness, make sure that all functions also work with empty coordinate arrays (shape=(0, 3)) and add corresponding test cases. (Fixed via PR #2083)
    • [x] Ensure capped_distance() and self_capped_distance() always return numpy arrays and add corresponding tests. (Fixed via PR #2083)
    • [x] Ensure _bruteforce_capped_self() returns empty distances if no pairs are found. Add corresponding tests for all *capped_distance() functions. (Fixed via PR #2083)
    • [x] Fix _bruteforce_capped() for the case when all input coordinates are the same and box is None. (Fixed via PR #2083)
    • [x] Move lib.distances._check_box() to lib.util (with underscore removed). (Fixed via PR #2114)
    • [x] Ensure that all methods used in capped_distance() have the same cut-off criterion (distances <= max_cutoff). (Fixed via PR #2114)
    • [x] Use double precision for distance computation in lib.nsgrid. (Fixed via PR #2114)
    • [x] Fix box type detection in lib.nsgrid and some functions in lib.distances. (Fixed via PR #2114)
    • [ ] Optimize parallelism in lib/include/calc_distances.h for 64 byte cachelines (avoid false sharing).
    • [ ] Parallelize inner instead of outer loop in lib.distances.distance_array().

That's quite a lot of things to do, but I've already started working on most of the points. There are still issues to be discussed, especially the module's title ~and description and how to proceed with the requirements for boxes~.

Current version of MDAnalysis:

~0.18.1-dev~ 0.19.1-dev

zemanj avatar Aug 13 '18 15:08 zemanj

@zemanj so with the openmp loops, I thought it was preferable to have as few fork/joins as possible, so ideally one for this double loop. Many it would be a good idea to ensure that the numref loop was larger than numconf loop. Would be interesting to see if this changes performance (ie 50 vs 1000 and 1000 vs 50)

WRT false sharing, we can just set the schedule to be the width of however many floats we can fetch at once? I would hope that the default scheduler would be doing that.

richardjgowers avatar Mar 06 '19 18:03 richardjgowers

I added tidying the namespace, we currently dump everything there, which isn't useful to users (doing tab completion)

richardjgowers avatar Mar 06 '19 18:03 richardjgowers

@richardjgowers Yes, we need something like lib.pbctools or so. We should also deprecate the direct access to lib functions in the analysis module.

With "make C functions invisible" you mean adding a couple of dels at the end of c_distances.pyx and c_distances_openmp.pyx?

Regarding parallel performance, I got that already under control (dynamically reacting on array lengths etc.), just haven't had the time to wrap it up yet. Contrary to my first intuition, false sharing is either not a big deal or it is rather hard to tackle, for example when the threads in self_diatance_array() dump their results in the output array (IIRC). Keep in mind that false sharing is only relevant in write operations, and it might sometimes compete with memory locality when reading data. I also played around with OpenMP schedules to no avail. The default schedule seems to be pretty well suited.

zemanj avatar Mar 07 '19 03:03 zemanj

Hey, Just trying to take a hit at the lib.distances.distance_array() issue. This seems to be the relevant codebase where the issue is referring:

https://imgur.com/7sgIP8B

Could anybody guide me to which loop the issue is colluding or if the issue is solved and should be closed in the latest version?

tanmaymunjal avatar Aug 27 '22 04:08 tanmaymunjal

Hi @tanmaymunjalz the loop in question is in the C header package/MDAnalysis/lib/include/calc_distances.h and is at line 383 406 and 431. I would make links but I'm on my phone.

hmacdope avatar Aug 27 '22 07:08 hmacdope

Thanks! Will take a look

tanmaymunjal avatar Aug 27 '22 15:08 tanmaymunjal