hoomd-blue icon indicating copy to clipboard operation
hoomd-blue copied to clipboard

PairList accessor for NeighborLists

Open ianrgraham opened this issue 2 years ago • 17 comments

Description

Allows the NeighborList Python class to act as a standalone Compute, and adds an accessor method to retrieve a pair list. The implementation isn't completely finished, but it's working for CPU (and likely GPU) builds, MPI is probably broken. I started the pull request so I could get some feedback on the Python API, as well as how the C++ implementation for getPairList should behave.

From the Python side I added one new parameter to all of the nlists, default_r_cut, which defaults to None. The behavior here differs from pair potentials in that if the parameter remains None, it sets the default r_cut to 0 for all type pairs. Users can now modify the nlist's r_cuts after initialization using the TypeParameter. The pair list at the current timestep is accessed with the pair_list property, which is loggable (untested). Does this API look suitable?

The C++ impl of getPairListPython updates the nlist, loops through all neighbors (unfiltered, including neighbors within r_cut + buffer), and returns a vector of unsigned int pairs. Here I'm curious what the behavior should be when multiple nlist consumers are registered to the NeighborList. Should it select neighbors based upon rcut_max, rcut_base, or just leave the filtering job up to the caller? Should there be some logic to handle multiple cases? Should we cache the pair list somewhere if the same case is encountered at the same timestep? Is an SOA return type preferred to AOS? At the moment the pair list is respecting the current StorageMode of the nlist. Should it always yield the "full" or "half" nlist instead?

I'd also like to implement a zero-copy accessor the the 3 primary neighbor list arrays (NNeighArray, NListArray, and HeadList). I'm going to try copying from the context manager implementation of {device}_local_force_arrays. Any advice there would be appreciated.

Motivation and context

Resolves #1259

How has this been tested?

I'm intending on implement a few additional pytests in the md module.

  • [ ] Test behavior before and after attaching/detaching nlist from simulation
  • [ ] Test pair_list correctness and API usage
    • [ ] Specified default_r_cut, no other nlist consumers
    • [ ] default_r_cut + additional nlist consumers with different r_cuts
    • [ ] Manually specified r_cut after nlist instantiation
  • [ ] Test logging pair list to file

I haven't modified docs yet, but once the Python API is fixed I'll do so.

  • [ ] Update docs to reflect pair_list API

Change log

  • [ ] Propose change log entry when API finalized

Checklist:

ianrgraham avatar Sep 07 '22 03:09 ianrgraham

Here are my thoughts. All points are open to further discussion.

The pair list at the current timestep is accessed with the pair_list property, which is loggable (untested). Does this API look suitable?

Yes, this is a good API. I question whether pair_list should be loggable. It comes down to whether pair_list is a rank-local pair list with particles by id or a global pair list with particles by tag. More on this below.

The C++ impl of getPairListPython updates the nlist, loops through all neighbors (unfiltered, including neighbors within r_cut + buffer), and returns a vector of unsigned int pairs. Here I'm curious what the behavior should be when multiple nlist consumers are registered to the NeighborList. Should it select neighbors based upon rcut_max, rcut_base, or just leave the filtering job up to the caller?

The filtering should be left to the caller. Reason: To support consumer classes that use the Python API. That consumer needs to access the neighbors within the r_cut it requests which is included in the combined maximum r_cut + buffer that NeighborList captures.

Should there be some logic to handle multiple cases?

We could, but the complexity is only necessary if we have some strong use-cases for it. The most general API would be to provide a pair list given a method that accepts the r_cut matrix as a parameter:

pair_list = nlist.compute_pair_list(r_cut_matrix)

However, this API breaks many of the optimization patterns present in NeighborList, such as the reuse when particles move less than 1/2 the buffer distance and requires an additional pass over the particle positions to filter.

Should we cache the pair list somewhere if the same case is encountered at the same timestep?

We could. Some classes in HOOMD employ this pattern while others do not. The pair list generation is expensive enough that it is certainly worth implementing the cache.

Is an SOA return type preferred to AOS?

The consensus in the #1259 discussion is to store pair_list in a dense array with dimensions [2, n_pair] so that memory operations in GPU threads over pairs perform contiguous memory accesses. CPU caches are large enough that the the layout will not change performance noticably.

At the moment the pair list is respecting the current StorageMode of the nlist. Should it always yield the "full" or "half" nlist instead?

The full mode exists to enable HOOMD's no communication pair potential parallelization scheme (the i <-> j and j <-> i forces are computed separately in parallel). If we expect users of pair_list to also use such a scheme, we could provide the option. Otherwise, I think most users (especially CPU users) to use the half mode: storing only pairs where i < j.

I'd also like to implement a zero-copy accessor the the 3 primary neighbor list arrays (NNeighArray, NListArray, and HeadList). I'm going to try copying from the context manager implementation of {device}_local_force_arrays. Any advice there would be appreciated.

My thoughts are that these are part of the internal implementation details of HOOMD and subject to change. With a short GPU kernel followed by a thrust stream compaction, we can generate a local pair_list (with a stable API) on the GPU for a minimal cost. The generic pair_list data structure can be used in numpy, cupy, and other applications and could be accessed directly with local arrays. This would make pair_list local to the rank.

Would that fit your use-case? Or do you need to remove the pair_list generation cost and/or do you have a use-case that works best with HOOMD's convoluted internal storage format.

Circling back to the MPI question, the pair_list I propose would be local and store particles by index. This would not be suitable for logging. If we wanted to provide an interface that more closely matches the "global" and "local" snapshot concept, we could name this pair_list_local and also provide a global pair_list that collects the local list from all ranks, merges them, and converts particle indices to tags. This global list would be loggable.

joaander avatar Sep 07 '22 12:09 joaander

Just some more thoughts to add. For this to be useful in Python, we would need the value returned to be a NumPy array. It would be great if we could do this without copying as well (https://stackoverflow.com/questions/54876346/pybind11-and-stdvector-how-to-free-data-using-capsules) has a examples of this. However, this would only work on the CPU. We could create a C++ class that represented a GPU buffer using __cuda_array_interface__ as we already do in our local access classes to avoid a copy on the GPU as well.

b-butler avatar Sep 07 '22 14:09 b-butler

Just some more thoughts to add. For this to be useful in Python, we would need the value returned to be a NumPy array. It would be great if we could do this without copying as well (https://stackoverflow.com/questions/54876346/pybind11-and-stdvector-how-to-free-data-using-capsules) has a examples of this. However, this would only work on the CPU. We could create a C++ class that represented a GPU buffer using cuda_array_interface as we already do in our local access classes to avoid a copy on the GPU as well.

That all sounds good to me! I think it's just a question of how best to separate the various access patterns to meet the most common use cases. I think primarily there is the goal of handing the user both rank-local and global versions of a numpy pair list (yea my impl with std::pair doesn't do that, but it's an easy fix following some examples) that's only intended to be accessed from the CPU. Like @joaander suggests, this could be pair_list and pair_list_local (or global_pair_list and local_pair_list), and it basically helps users prototype force computes (or even nlist-based analyzers) in Python with little fuss.

Then I think there is the business of providing more low level access to the internal arrays for users that want to use cython, numba, numba.cuda, or even pyo3 to write more performant implementations. I think there it's best to just hand over an __array_interface__ or __cuda_array_interface__ wrapper around the internal representations using a context manager. Maybe call these cpu_local_nlist_arrays and gpu_local_nlist_arrays?

Also on the GPU, do you think it is a good idea to even offer a pair list API? I get the impression that 3-array internal representation was designed so that it's easy to reduce the forces on a single particle, so to avoid atomics? If so, it's probably best to encourage the native pattern on GPUs?

ianrgraham avatar Sep 07 '22 15:09 ianrgraham

I would like to avoid the context manager approach unless we are committing to creating a new member variable for the neighbor list that stores the pairs. Even then, I would want to avoid this. I would rather users be responsible for caching in Python. The context manager approach is most useful when you want to be able to mutate or read an existent buffer in Python. It has its flaws though. We cannot prevent users from SEGFAULTing on the GPU for instance.

b-butler avatar Sep 07 '22 15:09 b-butler

Oh I don't think there should be a context manager for the pair list. I'm suggesting that it might be a good idea to also expose these arrays using a context manager for very low-level access, basically copying the cpu_local_snapshot and cpu_local_force_arrays APIs. https://github.com/glotzerlab/hoomd-blue/blob/6a508701a09f0471723c8ed2c8392f52a0eecf35/hoomd/md/NeighborList.h#L516-L517 https://github.com/glotzerlab/hoomd-blue/blob/6a508701a09f0471723c8ed2c8392f52a0eecf35/hoomd/md/NeighborList.h#L522 This would be separate to the pair list.

ianrgraham avatar Sep 07 '22 16:09 ianrgraham

Just looking to get some feedback on the low-level idea, but I can separate it into another pull request if folks are interested in that feature.

ianrgraham avatar Sep 07 '22 16:09 ianrgraham

Like @joaander suggests, this could be pair_list and pair_list_local (or global_pair_list and local_pair_list), and it basically helps users prototype force computes (or even nlist-based analyzers) in Python with little fuss.

To be consistent with how we name this in other places, it should be:

  • pair_list - global pair list, like State.get_snapshot and md.Force.forces.
  • local_pair_list rank local pair list, no direct analog but like State.cpu_local_snapshot.
  • cpu_local_* and gpu_local_* for direct access, like State.?pu_local_snapshot (though @b-butler recommends not using the context manager approach?). Unless we have better names for these.

Also on the GPU, do you think it is a good idea to even offer a pair list API? I get the impression that 3-array internal representation was designed so that it's easy to reduce the forces on a single particle, so to avoid atomics? If so, it's probably best to encourage the native pattern on GPUs?

I don't know whether users would find a pair list API useful on GPUs. HOOMD doesn't internally as you say to parallelize the computation and maximize bandwidth when executing multiple threads per particle. I've had very little feedback from users on this topic. I know one application (HOOMD-TF) implements a C++ plugin to convert the neighborlist to a different data structure suitable for use in TensorFlow. That's why I say - if you think this is valuable and you will use it, feel free to expose it (read only) to users. The only downsides are 1) We need to document the ragged sparse array format and 2) Any internal changes necessitate a major version bump. Neither of these is a major problem.

joaander avatar Sep 07 '22 16:09 joaander

Yea no, I don't think a GPU pair list is a good idea, just confirming with the rest of y'all. I think the useful GPU feature is to just expose the internal arrays, readonly.

ianrgraham avatar Sep 07 '22 16:09 ianrgraham

@joaander and @ianrgraham To clarify I am not advocating against the local access model where it applies, just in computed quantities like pair_list and local_pair_list when the easier and safer memory model is to just return the std::vector or to use pybind11::capsule. For exposing the neighborlist buffers the local access model will need to be used.

b-butler avatar Sep 08 '22 14:09 b-butler

I also agree that if the 3 buffers are usefult to you in Python that exposing the using the local access API is fine, and we can major version bump if we need to change the buffers.

b-butler avatar Sep 08 '22 14:09 b-butler

@joaander and @ianrgraham To clarify I am not advocating against the local access model where it applies, just in computed quantities like pair_list and local_pair_list when the easier and safer memory model is to just return the std::vector or to use pybind11::capsule.

Are these methods an alternative to the one used here:

https://github.com/glotzerlab/hoomd-blue/blob/6a508701a09f0471723c8ed2c8392f52a0eecf35/hoomd/ParticleData.cc#L3822-L3832 ?

joaander avatar Sep 08 '22 14:09 joaander

Are these methods an alternative to the one used here

I'm pretty sure it boils down to who is intended to have ownership of the data at the end of the function call. In the capsule example, the caller will gain ownership, and once the array is deleted on their side the capsule callback will delete the data. In the pybind11::array example without a capsule, I'm pretty sure the callee (the SnapshotParticleData object) still retains ownership, and the numpy array is basically just a reference.

ianrgraham avatar Sep 08 '22 15:09 ianrgraham

So yea, @b-butler is right and for the pair list we would want to use a capsule if we're not caching anything on the C++ side. But it kinda just occurred to me that if one's using a decent buffer, the pair list could be unchanged over many timesteps. Might be worth it to use shared pointers so that the C++ can hold onto a reference, which it keeps returning if nothing is changed and cheaply hands back to Python, and whenever the nlist is updated we just allocate a new pair list that will get cleaned up with shared pointers.

ianrgraham avatar Sep 08 '22 15:09 ianrgraham

@joaander The method you show as @ianrgraham says implies that data ownership lies with the snapshot object and owning object (i.e. the snapshot) promises to keep the pointer alive until the NumPy array dies. This actually may lead to possible segfaults looking at it now. A pathological case would be

snap = sim.state.get_snapshot()
pos = snap.particles.position
del snap
# Access deallocated memory
pos[0]

The py::capsule approach gives ownership to NumPy/Python.

Also @ianrgraham, I think passing a shared_ptr<Scalar[]> to py::capsule would accomplish what you are talking about.

b-butler avatar Sep 08 '22 16:09 b-butler

The , self) argument ties the lifetimes of the numpy array to the snapshot object, so the snapshot object will not be deleted until all references to the internal arrays also are deleted. This allows us to prevent costly memory allocations and multiple copies of data.

I think the pathological case comes in when C++ resizes the arrays, deallocating the old pointers in the process. Any existing numpy arrays holding those pointers would then be invalid.

snap = sim.state.get_snapshot()
pos = snap.particles.position
snap.particles.position.N = 10
# Access deallocated memory
pos[0]

Would the capsule approach avoid this case while still allowing for zero-copy access in non-pathological cases?

joaander avatar Sep 08 '22 16:09 joaander

Ah, @joaander I missed the handle for ownership, and yes using py::capsule with std::shared_ptr<type[]> would solve this. This is what freud does.

b-butler avatar Sep 08 '22 17:09 b-butler

@joaander ahh, I didn't realize that it was using this impl of array. That's pretty cool, though I don't totally understand what pybind is doing with the base handle to tie the lifetimes.

And @b-butler, I'll use the capsule mechanism to pass the pair list to python then. Thanks for the suggestion!

I should be able to turn this around fairly soon.

ianrgraham avatar Sep 08 '22 17:09 ianrgraham

Going to run the suite of pytests (hopefully once 🤞) before I write the new pytests, then it should be ready for review.

ianrgraham avatar Sep 28 '22 16:09 ianrgraham

Going to run the suite of pytests (hopefully once 🤞) before I write the new pytests, then it should be ready for review.

It appears you need a slightly larger system (increase the n argument to lattice_snapshot_factory) so that MPI tests will pass: https://github.com/glotzerlab/hoomd-blue/actions/runs/3145718687/jobs/5113983775#step:7:5079

joaander avatar Sep 29 '22 13:09 joaander

Yea I noticed that issue, but there is also something else breaking when there is domain decomposition with the global pair list (the rank local stuff is working). When I run the pytests with 2 ranks, I keep getting that the second MPI rank is aborting, but I can't figure out how to debug it. Do you have any tips? Trying the naive thing of writing print statements and passing the -s argument to pytest to print standard output seems to get ignored.

ianrgraham avatar Sep 29 '22 14:09 ianrgraham

Also the global pair list is returning more neighbors with MPI than without. My guess is because I need to remove additional copies of pairs across domains? Not totally sure. Though I'll commit what I currently have if someone doesn't mind taking a peak at the global pair list code and tests.

ianrgraham avatar Sep 29 '22 14:09 ianrgraham

Yea I noticed that issue, but there is also something else breaking when there is domain decomposition with the global pair list (the rank local stuff is working). When I run the pytests with 2 ranks, I keep getting that the second MPI rank is aborting, but I can't figure out how to debug it. Do you have any tips? Trying the naive thing of writing print statements and passing the -s argument to pytest to print standard output seems to get ignored.

You can run pytest with the --capture=no argument to show all stdout. Run just one test file at a time with this to make the output easier to parse:

pytest -v -x -ra --capture=no hoomd/md/pytest/some_test.py

Also the global pair list is returning more neighbors with MPI than without. My guess is because I need to remove additional copies of pairs across domains? Not totally sure. Though I'll commit what I currently have if someone doesn't mind taking a peak at the global pair list code and tests.

I'm busy this week but can hopefully take a look next week. It can be tricky to get MPI communication working properly. I haven't looked at your code yet, but my initial thought would be to filter the ghost particles out of the local pair list, then convert from ids to tags, then perform a MPI gather operation.

joaander avatar Sep 29 '22 15:09 joaander

Solved the MPI issue, just a pybind return type issue when not on rank 0. But now I'm just scratching my head a bit with ghost particles. Is it normal for the data at ghost particle indices to be junk at times? I'm finding that with domain decomposition, the position information for the ghost particles is just wrong, such that |pos_i - pos_j| < r_cut and these pairs are incorrectly being added to the list.

ianrgraham avatar Sep 29 '22 18:09 ianrgraham

Is it normal for the data at ghost particle indices to be junk at times? I'm finding that with domain decomposition, the position information for the ghost particles is just wrong, such that |pos_i - pos_j| < r_cut and these pairs are incorrectly being added to the list.

No, it is not normal. The call to m_comm->communicate(timestep); should communicate the ghost particle positions between ranks. If this is not called, then there will be no ghost particles. There may be other corner cases.

joaander avatar Sep 29 '22 18:09 joaander

I think I figured it out, the neighbor list itself wasn't setting the comm_flag::position. It looks like it was assumed that if you have a neighbor list there must be something else in the simulation that is requesting positions, which was true before these changes.

There are still a few ghost neighbors that are passing through with incorrect positions, but setting the flag fixed a lot of them.

ianrgraham avatar Sep 29 '22 20:09 ianrgraham

I think I figured it out, the neighbor list itself wasn't setting the comm_flag::position. It looks like it was assumed that if you have a neighbor list there must be something else in the simulation that is requesting positions, which was true before these changes.

Good find! It appears that this flag is set by ForceCompute in the current codebase. Yes, it needs to be set by NeighborList as well.

There are still a few ghost neighbors that are passing through with incorrect positions, but setting the flag fixed a lot of them.

Do you know why? Can you post a minimal script that reproduces this? This shouldn't be occurring, and I'd be happy to look into it further.

The current test failure appears to be caused by accessing pair_list on non-root ranks: https://github.com/glotzerlab/hoomd-blue/actions/runs/3154493032/jobs/5132712878#step:7:4944 Just wrap that in a if device.communicator.rank == 0: check.

joaander avatar Oct 04 '22 19:10 joaander

Do you know why? Can you post a minimal script that reproduces this? This shouldn't be occurring, and I'd be happy to look into it further.

So it was a bit of a false alarm. I wasn't aware that the communicator would sometimes wrap positions of ghost particles if they were near the boundary of the global box. So sometimes I was seeing that with a box cube with L=7, a particle at [0,0,3] in it's own rank (and the global snapshot) will be viewed as a ghost partilcle [0,0,-4] in its neighboring domain over the periodic boundary.

The big problem that I'm having is something I noticed even outside of MPI, and I can't explain why it's happening. When I create a neighborlist and assign it r_cuts, the nlist works as expected. If I create a neighborlist and don't give it an r_cut, I'll get zero neighbor as expected. If I take this r_cut=0 nlist and add an nlist consumer like a pair potential with an r_cut that should give me the same neighbors as before, I surprisingly get no neighbors. I was trying to debug this, and I was amazed to see that the r_cut_matrix is being added and then quickly removed from the consumer list. Just now I figured out that this is occurring only when I register the neighborlist as a standalone compute in addition to being tied to the pair, so something like this.

nlist = hoomd.md.nlist.Cell(buffer=0.0, default_r_cut=0.0)

sim: hoomd.Simulation = simulation_factory(lattice_snapshot_factory())
integrator = hoomd.md.Integrator(0.01)
lj = hoomd.md.pair.LJ(nlist, default_r_cut=1.1)  # cubic lattice has neighbors at a spacing of a=1.0
lj.params[('A', 'A')] = dict(epsilon=1.0, sigma=1.0)
integrator.forces.append(lj)
integrator.methods.append(hoomd.md.methods.NVE(hoomd.filter.All()))
sim.operations.integrator = integrator

sim.operations.computes.append(nlist)  # if I comment this out, I will get the correct pair list
# ^ here also seems to be where the lj pair r_cut_matrix is randomly removed from the nlist

sim.run(0)

local_pair_list = nlist.local_pair_list  # pair list is empty

I think I can figure this out with a bit more time, but it's clearly something odd about how the nlist object gets attached to the simulation.

Edit: Or should this just be an error condition?

ianrgraham avatar Oct 05 '22 14:10 ianrgraham

Or should this just be an error condition?

No, I would think that this use-case should be supported. @b-butler are you aware of any issues with objects both appearing in Operations.computes and also being dependents of other operations? It seems from this description that the nlist is being attached twice, or something strange is happening. This has implications for your work on #1413 as well.

joaander avatar Oct 05 '22 14:10 joaander

We used to have a dependency on the integrator with the Clusters object to enable integrator switching. It should be possible, but yes currently it calls attach twice here, https://github.com/glotzerlab/hoomd-blue/blob/c1b95d29a7e41a16542bdf36e8ee8cdb1f687e5d/hoomd/data/syncedlist.py#L191-L202

We could change this to check if an object is attached first. This should be safe assuming the adding code works as intended.

Edit: I added the necessary line to #1413.

b-butler avatar Oct 05 '22 14:10 b-butler

Cool! I'll test to make sure that change fixes it.

ianrgraham avatar Oct 05 '22 15:10 ianrgraham