Refactor particleset attributes and particle data for vectorized kernels
Following from #2122, it would be good to rethink the API and refactor the code for particleset data. See below the original comment by @veckothegecko
Hmm, I assume that this change was needed since we are passing the particleset all the way to the kernels now? I think we should at some point pass the particle data itself down - as I touched on f2f, there is a distinction between the particle data and the ParticleSet (e.g., it doesn't make sense to call pset.execute() in a kernel). By passing the particledata down to the kernel it makes this clear. This also means that we don't silently limit users variable names (currently they can't call their variable names
executeor anything that is already a method/attr on ParticleSet).
Happy for that to be a future issue.
Perhaps:
- Make ParticleData its own class (dict of arrays under the hood, but allowing
.attrnotation to get and set values)- Only have private variables/methods/names on this dataset (i.e.,
_data). All public attribute access will be assumed to correspond with particle data -> this results in no shadowing - Disallow names starting with
_inparcels/particle.py@Variableas this could collide with PariticleData
- Only have private variables/methods/names on this dataset (i.e.,
- Pass ParticleData down to the kernels
- Update
pset._datato bepset.dataso that its on our public API.- @erikvansebille can we then remove the
__getattr__and__setattr__methods on thepset? I don't think its too prohibitive to dopset.data.lonif users want to access their lon values, or would you prefer to still have thepset.lon? (a getattr? or also a setattr?)
- @erikvansebille can we then remove the
Also, thoughts on explicitly disallowing object.__setattr__(self, name, value) in that last else clause? I know that its to maintain the default behaviour for Python objects, but I don't see the benefit in supporting it here - and since it works outside of our model it means that that data would fall through the cracks if we write particle data (and also means that if people typo in their kernel it will be a silent bug)
Originally posted by @VeckoTheGecko in https://github.com/OceanParcels/Parcels/pull/2122#discussion_r2263531738
Note that with the change in https://github.com/OceanParcels/Parcels/pull/2122/commits/bee54642db18a0f4cbb08905bf470638527ed754, we now pass a list of KernelParticles to the Kernels, instead of the entire ParticleSet: https://github.com/OceanParcels/Parcels/blob/081753dbce1ea327d1c2ed6fcc147d81183caa10/parcels/kernel.py#L255-L258
So this might solve (some) of the problems above?
we now pass a list of KernelParticles to the Kernels, instead of the entire ParticleSet
*not a list of KernelParticle instances, but a KernelParticle instance where the indices is a boolean mask (instead of single indices - 1, 2, 3 etc. - as it was before)
Yes, KernelParticle in #2122 is helping with the problems above. But naming is now a bit confusing as it isn't about a single particle, and would be good to formalise this properly in a particle data class with proper testing. For a future PR - I can work on that once I'm back (also fielding some interesting ideas from people here, which has been helpful)
Another aspect to consider is the implementation of __getitem__:
https://github.com/OceanParcels/Parcels/blob/d4fdc01e9913db51dd18e4eb1afcf2e6c8250b02/parcels/particle.py#L141-L143
See also https://github.com/OceanParcels/Parcels/pull/2122#discussion_r2285507966
Note that https://github.com/OceanParcels/Parcels/pull/2199#issuecomment-3285145938 now has a minimal breaking example
@erikvansebille is this the issue most relevant to the KernelParticle discussion? Also, I feel like there was another issue with test cases (repeatedly assigning to variables and that causing broken state) but I can't seem to find it
I feel like there was another issue with test cases (repeatedly assigning to variables and that causing broken state)
Found it https://github.com/Parcels-code/Parcels/pull/2199#issuecomment-3285278876
I think this problem runs deeper than a simple refactoring issue. I've have reached out to others to ask what they think - hence I have written this comment here in a way that assumes zero prior knowledge about Parcels internals (or about what's discussed above in this issue).
Parcels is a simulation framework that looks to integrate particle trajectories in flow fields by repeatedly updating their positions in a loop based on "Kernels" (i.e., functions that define this custom behaviour - can be provided by the package or written by users).
When running a simulation, only want to act on particles within simulation. This can be - for example - a timed release where particles are gradually released in the simulation. Here we only want to update a portion of the particle data.
Here is a diagram showing the layout that we have in mind.
Trying to implement this via a custom KernelParticleData class, however, is proving difficult due to array masking in numpy. I'm not really sure how to proceed, or what would be a good pattern to use. I think on a fundamental level we might be running into the difference between views and copies
as well.
Here's a minimal example of the code showing the problem. Keen to have ideas on how we can better implement KernelParticleData (if that is possible).
import numpy as np
CURRENT_TIME = np.datetime64("2024-01-05") # clock time of simulation - constant just for this example
N = 11
N_IN_SIMULATION = 5
def get_particle_data():
data = {
"lon": np.linspace(900.0, 1000.0, N),
"lat": np.linspace(900.0, 1000.0, N),
"depth": np.linspace(900.0, 1000.0, N),
"time": np.concat(
[
np.repeat(CURRENT_TIME, N_IN_SIMULATION).astype("datetime64[ns]"),
np.array([np.datetime64(f"2024-01-{d:02}") for d in range(6, 12)]).astype("datetime64[ns]"),
]
),
}
assert all(len(v) == N for v in data.values())
return data
class KernelParticleData:
"""Simple class to be used in a kernel that links a particle (on the kernel level) to a particle dataset."""
def __init__(self, data, index):
self._data = data
self._index = index
def __getattr__(self, name):
return self._data[name][self._index]
def __setattr__(self, name, value):
if name in ["_data", "_index"]:
object.__setattr__(self, name, value)
else:
self._data[name][self._index] = value
def __getitem__(self, index):
self._index = index
return self
def __len__(self):
return len(self._index)
# =======
# Kernels
# =======
# Behaviour defined by the user to be run on particles that are in the simulation
# `fieldset` provides the flow data - not used in these examples
def kernel1(particles, fieldset):
assert particles.lon.size == N_IN_SIMULATION
particles.lon = 0
def kernel2(particles, fieldset):
assert particles.lon.size == N_IN_SIMULATION
particles.lon = 0
# is there a way that I can define subsets of particles, and then update only those subsets?
# the following code would be great, but I don't know how to support on the back end :/
psubset = particles[particles.lon > 950.0]
assert psubset.lon.size == 0
def main():
# pset = ParticleSet(fieldset, lon=[0.5], lat=[0.5])
particle_data = get_particle_data()
mask_to_evaluate = particle_data["time"] == CURRENT_TIME # evaluate particles whos clock is the current time
kernel_particle_data = KernelParticleData(particle_data, mask_to_evaluate)
kernel1(kernel_particle_data, fieldset=None)
assert np.all(particle_data["lon"][:N_IN_SIMULATION] == 0)
assert np.all(particle_data["lon"][N_IN_SIMULATION:] >= 900.0)
print("kernel1 works!")
# Reset
particle_data = get_particle_data()
kernel_particle_data = KernelParticleData(particle_data, mask_to_evaluate)
try:
kernel2(kernel_particle_data, fieldset=None)
except Exception as e:
print("kernel2 doesn't work!")
raise e
main()
OK, so I dived deep into this KernelParticle Issue, and I think I fixed it. The minimal example above is actually quite easy to fix:
def __getitem__(self, index):
- self._index = index
- return self
+ new_index = np.zeros_like(self._index, dtype=bool)
+ return KernelParticleData(self._data, new_index)
This will return an object with index only where the original _index was True and the new index is True, thereby subsetting the index as desired!
However, also being able to do operations like particles[mask] += val turned out to be much more complex. It required building an entire new DataClass. See #2443 for the implementation