hoomd-blue
hoomd-blue copied to clipboard
PairList accessor for NeighborLists
Description
Allows the NeighborList
Python class to act as a standalone Compute
, and adds an accessor method to retrieve a pair list. The implementation isn't completely finished, but it's working for CPU (and likely GPU) builds, MPI is probably broken. I started the pull request so I could get some feedback on the Python API, as well as how the C++ implementation for getPairList
should behave.
From the Python side I added one new parameter to all of the nlists, default_r_cut
, which defaults to None
. The behavior here differs from pair potentials in that if the parameter remains None
, it sets the default r_cut
to 0 for all type pairs. Users can now modify the nlist's r_cut
s after initialization using the TypeParameter. The pair list at the current timestep is accessed with the pair_list
property, which is loggable (untested). Does this API look suitable?
The C++ impl of getPairListPython
updates the nlist, loops through all neighbors (unfiltered, including neighbors within r_cut + buffer), and returns a vector of unsigned int pairs. Here I'm curious what the behavior should be when multiple nlist consumers are registered to the NeighborList. Should it select neighbors based upon rcut_max
, rcut_base
, or just leave the filtering job up to the caller? Should there be some logic to handle multiple cases? Should we cache the pair list somewhere if the same case is encountered at the same timestep? Is an SOA return type preferred to AOS? At the moment the pair list is respecting the current StorageMode
of the nlist. Should it always yield the "full" or "half" nlist instead?
I'd also like to implement a zero-copy accessor the the 3 primary neighbor list arrays (NNeighArray, NListArray, and HeadList). I'm going to try copying from the context manager implementation of {device}_local_force_arrays
. Any advice there would be appreciated.
Motivation and context
Resolves #1259
How has this been tested?
I'm intending on implement a few additional pytests in the md
module.
- [ ] Test behavior before and after attaching/detaching nlist from simulation
- [ ] Test
pair_list
correctness and API usage- [ ] Specified
default_r_cut
, no other nlist consumers - [ ]
default_r_cut
+ additional nlist consumers with different r_cuts - [ ] Manually specified
r_cut
after nlist instantiation
- [ ] Specified
- [ ] Test logging pair list to file
I haven't modified docs yet, but once the Python API is fixed I'll do so.
- [ ] Update docs to reflect
pair_list
API
Change log
- [ ] Propose change log entry when API finalized
Checklist:
- [x] I have reviewed the Contributor Guidelines.
- [x] I agree with the terms of the HOOMD-blue Contributor Agreement.
- [x] My name is on the list of contributors (
sphinx-doc/credits.rst
) in the pull request source branch.
Here are my thoughts. All points are open to further discussion.
The pair list at the current timestep is accessed with the
pair_list
property, which is loggable (untested). Does this API look suitable?
Yes, this is a good API. I question whether pair_list
should be loggable. It comes down to whether pair_list
is a rank-local pair list with particles by id or a global pair list with particles by tag. More on this below.
The C++ impl of
getPairListPython
updates the nlist, loops through all neighbors (unfiltered, including neighbors within r_cut + buffer), and returns a vector of unsigned int pairs. Here I'm curious what the behavior should be when multiple nlist consumers are registered to the NeighborList. Should it select neighbors based uponrcut_max
,rcut_base
, or just leave the filtering job up to the caller?
The filtering should be left to the caller. Reason: To support consumer classes that use the Python API. That consumer needs to access the neighbors within the r_cut
it requests which is included in the combined maximum r_cut + buffer
that NeighborList
captures.
Should there be some logic to handle multiple cases?
We could, but the complexity is only necessary if we have some strong use-cases for it. The most general API would be to provide a pair list given a method that accepts the r_cut matrix as a parameter:
pair_list = nlist.compute_pair_list(r_cut_matrix)
However, this API breaks many of the optimization patterns present in NeighborList
, such as the reuse when particles move less than 1/2 the buffer distance and requires an additional pass over the particle positions to filter.
Should we cache the pair list somewhere if the same case is encountered at the same timestep?
We could. Some classes in HOOMD employ this pattern while others do not. The pair list generation is expensive enough that it is certainly worth implementing the cache.
Is an SOA return type preferred to AOS?
The consensus in the #1259 discussion is to store pair_list
in a dense array with dimensions [2, n_pair]
so that memory operations in GPU threads over pairs perform contiguous memory accesses. CPU caches are large enough that the the layout will not change performance noticably.
At the moment the pair list is respecting the current
StorageMode
of the nlist. Should it always yield the "full" or "half" nlist instead?
The full
mode exists to enable HOOMD's no communication pair potential parallelization scheme (the i <-> j and j <-> i forces are computed separately in parallel). If we expect users of pair_list
to also use such a scheme, we could provide the option. Otherwise, I think most users (especially CPU users) to use the half
mode: storing only pairs where i < j
.
I'd also like to implement a zero-copy accessor the the 3 primary neighbor list arrays (NNeighArray, NListArray, and HeadList). I'm going to try copying from the context manager implementation of
{device}_local_force_arrays
. Any advice there would be appreciated.
My thoughts are that these are part of the internal implementation details of HOOMD and subject to change. With a short GPU kernel followed by a thrust stream compaction, we can generate a local pair_list
(with a stable API) on the GPU for a minimal cost. The generic pair_list
data structure can be used in numpy, cupy, and other applications and could be accessed directly with local arrays. This would make pair_list
local to the rank.
Would that fit your use-case? Or do you need to remove the pair_list
generation cost and/or do you have a use-case that works best with HOOMD's convoluted internal storage format.
Circling back to the MPI question, the pair_list
I propose would be local and store particles by index. This would not be suitable for logging. If we wanted to provide an interface that more closely matches the "global" and "local" snapshot concept, we could name this pair_list_local
and also provide a global pair_list
that collects the local list from all ranks, merges them, and converts particle indices to tags. This global list would be loggable.
Just some more thoughts to add. For this to be useful in Python, we would need the value returned to be a NumPy array. It would be great if we could do this without copying as well (https://stackoverflow.com/questions/54876346/pybind11-and-stdvector-how-to-free-data-using-capsules) has a examples of this. However, this would only work on the CPU. We could create a C++ class that represented a GPU buffer using __cuda_array_interface__
as we already do in our local access classes to avoid a copy on the GPU as well.
Just some more thoughts to add. For this to be useful in Python, we would need the value returned to be a NumPy array. It would be great if we could do this without copying as well (https://stackoverflow.com/questions/54876346/pybind11-and-stdvector-how-to-free-data-using-capsules) has a examples of this. However, this would only work on the CPU. We could create a C++ class that represented a GPU buffer using cuda_array_interface as we already do in our local access classes to avoid a copy on the GPU as well.
That all sounds good to me! I think it's just a question of how best to separate the various access patterns to meet the most common use cases. I think primarily there is the goal of handing the user both rank-local and global versions of a numpy pair list (yea my impl with std::pair doesn't do that, but it's an easy fix following some examples) that's only intended to be accessed from the CPU. Like @joaander suggests, this could be pair_list
and pair_list_local
(or global_pair_list
and local_pair_list
), and it basically helps users prototype force computes (or even nlist-based analyzers) in Python with little fuss.
Then I think there is the business of providing more low level access to the internal arrays for users that want to use cython
, numba
, numba.cuda
, or even pyo3
to write more performant implementations. I think there it's best to just hand over an __array_interface__
or __cuda_array_interface__
wrapper around the internal representations using a context manager. Maybe call these cpu_local_nlist_arrays
and gpu_local_nlist_arrays
?
Also on the GPU, do you think it is a good idea to even offer a pair list API? I get the impression that 3-array internal representation was designed so that it's easy to reduce the forces on a single particle, so to avoid atomics? If so, it's probably best to encourage the native pattern on GPUs?
I would like to avoid the context manager approach unless we are committing to creating a new member variable for the neighbor list that stores the pairs. Even then, I would want to avoid this. I would rather users be responsible for caching in Python. The context manager approach is most useful when you want to be able to mutate or read an existent buffer in Python. It has its flaws though. We cannot prevent users from SEGFAULTing on the GPU for instance.
Oh I don't think there should be a context manager for the pair list. I'm suggesting that it might be a good idea to also expose these arrays using a context manager for very low-level access, basically copying the cpu_local_snapshot
and cpu_local_force_arrays
APIs.
https://github.com/glotzerlab/hoomd-blue/blob/6a508701a09f0471723c8ed2c8392f52a0eecf35/hoomd/md/NeighborList.h#L516-L517
https://github.com/glotzerlab/hoomd-blue/blob/6a508701a09f0471723c8ed2c8392f52a0eecf35/hoomd/md/NeighborList.h#L522
This would be separate to the pair list.
Just looking to get some feedback on the low-level idea, but I can separate it into another pull request if folks are interested in that feature.
Like @joaander suggests, this could be
pair_list
andpair_list_local
(orglobal_pair_list
andlocal_pair_list
), and it basically helps users prototype force computes (or even nlist-based analyzers) in Python with little fuss.
To be consistent with how we name this in other places, it should be:
-
pair_list
- global pair list, likeState.get_snapshot
andmd.Force.forces
. -
local_pair_list
rank local pair list, no direct analog but likeState.cpu_local_snapshot
. -
cpu_local_*
andgpu_local_*
for direct access, likeState.?pu_local_snapshot
(though @b-butler recommends not using the context manager approach?). Unless we have better names for these.
Also on the GPU, do you think it is a good idea to even offer a pair list API? I get the impression that 3-array internal representation was designed so that it's easy to reduce the forces on a single particle, so to avoid atomics? If so, it's probably best to encourage the native pattern on GPUs?
I don't know whether users would find a pair list API useful on GPUs. HOOMD doesn't internally as you say to parallelize the computation and maximize bandwidth when executing multiple threads per particle. I've had very little feedback from users on this topic. I know one application (HOOMD-TF) implements a C++ plugin to convert the neighborlist to a different data structure suitable for use in TensorFlow. That's why I say - if you think this is valuable and you will use it, feel free to expose it (read only) to users. The only downsides are 1) We need to document the ragged sparse array format and 2) Any internal changes necessitate a major version bump. Neither of these is a major problem.
Yea no, I don't think a GPU pair list is a good idea, just confirming with the rest of y'all. I think the useful GPU feature is to just expose the internal arrays, readonly.
@joaander and @ianrgraham To clarify I am not advocating against the local access model where it applies, just in computed quantities like pair_list
and local_pair_list
when the easier and safer memory model is to just return the std::vector
or to use pybind11::capsule
. For exposing the neighborlist buffers the local access model will need to be used.
I also agree that if the 3 buffers are usefult to you in Python that exposing the using the local access API is fine, and we can major version bump if we need to change the buffers.
@joaander and @ianrgraham To clarify I am not advocating against the local access model where it applies, just in computed quantities like
pair_list
andlocal_pair_list
when the easier and safer memory model is to just return thestd::vector
or to usepybind11::capsule
.
Are these methods an alternative to the one used here:
https://github.com/glotzerlab/hoomd-blue/blob/6a508701a09f0471723c8ed2c8392f52a0eecf35/hoomd/ParticleData.cc#L3822-L3832 ?
Are these methods an alternative to the one used here
I'm pretty sure it boils down to who is intended to have ownership of the data at the end of the function call. In the capsule
example, the caller will gain ownership, and once the array is deleted on their side the capsule
callback will delete the data. In the pybind11::array
example without a capsule
, I'm pretty sure the callee (the SnapshotParticleData
object) still retains ownership, and the numpy array is basically just a reference.
So yea, @b-butler is right and for the pair list we would want to use a capsule
if we're not caching anything on the C++ side. But it kinda just occurred to me that if one's using a decent buffer, the pair list could be unchanged over many timesteps. Might be worth it to use shared pointers so that the C++ can hold onto a reference, which it keeps returning if nothing is changed and cheaply hands back to Python, and whenever the nlist is updated we just allocate a new pair list that will get cleaned up with shared pointers.
@joaander The method you show as @ianrgraham says implies that data ownership lies with the snapshot object and owning object (i.e. the snapshot) promises to keep the pointer alive until the NumPy array dies. This actually may lead to possible segfaults looking at it now. A pathological case would be
snap = sim.state.get_snapshot()
pos = snap.particles.position
del snap
# Access deallocated memory
pos[0]
The py::capsule
approach gives ownership to NumPy/Python.
Also @ianrgraham, I think passing a shared_ptr<Scalar[]>
to py::capsule
would accomplish what you are talking about.
The , self)
argument ties the lifetimes of the numpy array to the snapshot object, so the snapshot object will not be deleted until all references to the internal arrays also are deleted. This allows us to prevent costly memory allocations and multiple copies of data.
I think the pathological case comes in when C++ resizes the arrays, deallocating the old pointers in the process. Any existing numpy arrays holding those pointers would then be invalid.
snap = sim.state.get_snapshot()
pos = snap.particles.position
snap.particles.position.N = 10
# Access deallocated memory
pos[0]
Would the capsule approach avoid this case while still allowing for zero-copy access in non-pathological cases?
Ah, @joaander I missed the handle for ownership, and yes using py::capsule
with std::shared_ptr<type[]>
would solve this. This is what freud does.
@joaander ahh, I didn't realize that it was using this impl of array
. That's pretty cool, though I don't totally understand what pybind is doing with the base handle to tie the lifetimes.
And @b-butler, I'll use the capsule
mechanism to pass the pair list to python then. Thanks for the suggestion!
I should be able to turn this around fairly soon.
Going to run the suite of pytests (hopefully once 🤞) before I write the new pytests, then it should be ready for review.
Going to run the suite of pytests (hopefully once 🤞) before I write the new pytests, then it should be ready for review.
It appears you need a slightly larger system (increase the n
argument to lattice_snapshot_factory
) so that MPI tests will pass: https://github.com/glotzerlab/hoomd-blue/actions/runs/3145718687/jobs/5113983775#step:7:5079
Yea I noticed that issue, but there is also something else breaking when there is domain decomposition with the global pair list (the rank local stuff is working). When I run the pytests with 2 ranks, I keep getting that the second MPI rank is aborting, but I can't figure out how to debug it. Do you have any tips? Trying the naive thing of writing print statements and passing the -s
argument to pytest to print standard output seems to get ignored.
Also the global pair list is returning more neighbors with MPI than without. My guess is because I need to remove additional copies of pairs across domains? Not totally sure. Though I'll commit what I currently have if someone doesn't mind taking a peak at the global pair list code and tests.
Yea I noticed that issue, but there is also something else breaking when there is domain decomposition with the global pair list (the rank local stuff is working). When I run the pytests with 2 ranks, I keep getting that the second MPI rank is aborting, but I can't figure out how to debug it. Do you have any tips? Trying the naive thing of writing print statements and passing the
-s
argument to pytest to print standard output seems to get ignored.
You can run pytest with the --capture=no
argument to show all stdout. Run just one test file at a time with this to make the output easier to parse:
pytest -v -x -ra --capture=no hoomd/md/pytest/some_test.py
Also the global pair list is returning more neighbors with MPI than without. My guess is because I need to remove additional copies of pairs across domains? Not totally sure. Though I'll commit what I currently have if someone doesn't mind taking a peak at the global pair list code and tests.
I'm busy this week but can hopefully take a look next week. It can be tricky to get MPI communication working properly. I haven't looked at your code yet, but my initial thought would be to filter the ghost particles out of the local pair list, then convert from ids to tags, then perform a MPI gather operation.
Solved the MPI issue, just a pybind return type issue when not on rank 0. But now I'm just scratching my head a bit with ghost particles. Is it normal for the data at ghost particle indices to be junk at times? I'm finding that with domain decomposition, the position information for the ghost particles is just wrong, such that |pos_i - pos_j| < r_cut and these pairs are incorrectly being added to the list.
Is it normal for the data at ghost particle indices to be junk at times? I'm finding that with domain decomposition, the position information for the ghost particles is just wrong, such that |pos_i - pos_j| < r_cut and these pairs are incorrectly being added to the list.
No, it is not normal. The call to m_comm->communicate(timestep);
should communicate the ghost particle positions between ranks. If this is not called, then there will be no ghost particles. There may be other corner cases.
I think I figured it out, the neighbor list itself wasn't setting the comm_flag::position
. It looks like it was assumed that if you have a neighbor list there must be something else in the simulation that is requesting positions, which was true before these changes.
There are still a few ghost neighbors that are passing through with incorrect positions, but setting the flag fixed a lot of them.
I think I figured it out, the neighbor list itself wasn't setting the
comm_flag::position
. It looks like it was assumed that if you have a neighbor list there must be something else in the simulation that is requesting positions, which was true before these changes.
Good find! It appears that this flag is set by ForceCompute
in the current codebase. Yes, it needs to be set by NeighborList
as well.
There are still a few ghost neighbors that are passing through with incorrect positions, but setting the flag fixed a lot of them.
Do you know why? Can you post a minimal script that reproduces this? This shouldn't be occurring, and I'd be happy to look into it further.
The current test failure appears to be caused by accessing pair_list
on non-root ranks: https://github.com/glotzerlab/hoomd-blue/actions/runs/3154493032/jobs/5132712878#step:7:4944
Just wrap that in a if device.communicator.rank == 0:
check.
Do you know why? Can you post a minimal script that reproduces this? This shouldn't be occurring, and I'd be happy to look into it further.
So it was a bit of a false alarm. I wasn't aware that the communicator would sometimes wrap positions of ghost particles if they were near the boundary of the global box. So sometimes I was seeing that with a box cube with L=7, a particle at [0,0,3] in it's own rank (and the global snapshot) will be viewed as a ghost partilcle [0,0,-4] in its neighboring domain over the periodic boundary.
The big problem that I'm having is something I noticed even outside of MPI, and I can't explain why it's happening. When I create a neighborlist and assign it r_cuts, the nlist works as expected. If I create a neighborlist and don't give it an r_cut, I'll get zero neighbor as expected. If I take this r_cut=0 nlist and add an nlist consumer like a pair potential with an r_cut that should give me the same neighbors as before, I surprisingly get no neighbors. I was trying to debug this, and I was amazed to see that the r_cut_matrix is being added and then quickly removed from the consumer list. Just now I figured out that this is occurring only when I register the neighborlist as a standalone compute in addition to being tied to the pair, so something like this.
nlist = hoomd.md.nlist.Cell(buffer=0.0, default_r_cut=0.0)
sim: hoomd.Simulation = simulation_factory(lattice_snapshot_factory())
integrator = hoomd.md.Integrator(0.01)
lj = hoomd.md.pair.LJ(nlist, default_r_cut=1.1) # cubic lattice has neighbors at a spacing of a=1.0
lj.params[('A', 'A')] = dict(epsilon=1.0, sigma=1.0)
integrator.forces.append(lj)
integrator.methods.append(hoomd.md.methods.NVE(hoomd.filter.All()))
sim.operations.integrator = integrator
sim.operations.computes.append(nlist) # if I comment this out, I will get the correct pair list
# ^ here also seems to be where the lj pair r_cut_matrix is randomly removed from the nlist
sim.run(0)
local_pair_list = nlist.local_pair_list # pair list is empty
I think I can figure this out with a bit more time, but it's clearly something odd about how the nlist object gets attached to the simulation.
Edit: Or should this just be an error condition?
Or should this just be an error condition?
No, I would think that this use-case should be supported. @b-butler are you aware of any issues with objects both appearing in Operations.computes
and also being dependents of other operations? It seems from this description that the nlist is being attached twice, or something strange is happening. This has implications for your work on #1413 as well.
We used to have a dependency on the integrator with the Clusters
object to enable integrator switching. It should be possible, but yes currently it calls attach twice here,
https://github.com/glotzerlab/hoomd-blue/blob/c1b95d29a7e41a16542bdf36e8ee8cdb1f687e5d/hoomd/data/syncedlist.py#L191-L202
We could change this to check if an object is attached first. This should be safe assuming the adding code works as intended.
Edit: I added the necessary line to #1413.
Cool! I'll test to make sure that change fixes it.