kiddo_v1
kiddo_v1 copied to clipboard
A better PBC implementation
This incorporates some of the feedback you gave me on the last attempt. I had made a mess of my branch so I'm here with a new one and a new PR. This one has a much cleaner API and also has some correctness tests with brute force checks. I would like to continue adding a few more functions e.g. best_n_within
, but for now nearest
, nearest_one
, within
and within_unsorted
work with periodic boundary conditions.
I noticed that your check_point
suffered from the same ill so I changed it to allow early returns
fn check_point(&self, point: &[A; K]) -> Result<(), ErrorKind> {
for p in point {
if !p.is_finite(){
return Err(ErrorKind::NonFiniteCoordinate)
}
}
if let Some(periodic) = self.periodic {
for (idx, &p) in point.iter().enumerate() {
if p < A::zero() || p > periodic[idx] {
return Err(ErrorKind::OutOfPeriodicBounds)
}
}
}
Ok(())
}
Also I didn't get to thinking about how to serialize the Option
, which is now an Option<&'a [A; K]>
so that we don't have an identical copy of the array at every KdTree
.
I wrote a benchmark with 1m query points and 100k data points and timed building and querying and got (non-pbc/pbc)
average: 26/26 millis to build, 85/92 millis to query
running on 48 threads. I can include this bench if you'd like. The build time is expected to be exactly the same because in this implementation the build is the same. The PBCs are handled by checking if it is necessary to apply PBCs (e.g. if your 1NN distance > distance to edge of box) and doing additional queries with only those relevant images.
Hiya - just thought I'd drop a quick comment to let you know I've not forgotten about this PR! I was exploring ways to see how it would fit in with a potential change to the library that I'm experimenting with that will allow users to choose from more than one underlying implementation that will all implement a KdTree
trait. This way, we could have a Standard, Periodic, and an upcoming Index based internal representation that all expose the same trait, allowing people to switch between them.
This is proving to be quite a significant undertaking, so I've decided to get your changes in first and move back on to the Trait based approach afterwards.
I've got a few small changes I'd like to propose to your PBC PR though. I'll share them over the next day or two as soon as I've got all the tests passing, but the main aspects are:
-
removing
periodic: Option<&'a [A; K]>
from theKdTree
struct. Since thisperiodic
array is a reference to an array that exists outside of theKdTree
struct, we can instead keep it out of the struct and just pass it as an argument tonearest_one_periodic
and all of the other periodic-specific functions. This has a couple of advantages: The memory layout of theKdTree
struct does not increase in size, keeping the current memory efficiency for users that aren't using PBC. Additionally, for users that are working with serialized trees that are loaded in from the filesystem, since the memory layout is the same they won't need to rebuild all their serialized datastructures when upgrading their version of Kiddo to one that includes these changes. (This can be quite a significant overhead - I'm working with a serialized tree containing tens of millions of items that would take me quite some time to recreate). -
The test cases for PBC have quite complex code to check that the PBC behaviour works correctly. I've added a simpler, manually constructed test case where the expected results can be easily determined by simple manual calculations. I like your more heavyweight tests too, they're definitely a great addition, but especially for low level data structure libraries like this, it is good to have some more fundamental tests where the correctness of the library can be verified by hand, as a bit of an underlying sanity check.
Hopefully we can get this merged soon! Thanks again :-)
Hi!
I've opened a second PR for this which contains all your commits from this PR plus one more of my own. I made a few modifications:
- feat: PBC now expects the period as an argument when querying using PBC functions, rather than storing it on each node. This keeps the node size down and prevents the memory layout from changing with the introduction of PBC, which would force existing users to have to rebuild serialized trees if they wanted to upgrade.
- tests: PBC tests refactored to move common brute force PBC check code and data/query/tree generating into separate function
- tests: PBC tests use neater float asserts from approx package
- tests: Rayon removed and PBC unit test query point and data point count reduced to improve unit test run time
- style: re-ran cargo fmt and cargo clippy and fixed clippy lints that were introduced
Any chance you could take a look and see what you think?
Thanks!
Hi!
I've opened a second PR for this which contains all your commits from this PR plus one more of my own. I made a few modifications:
- feat: PBC now expects the period as an argument when querying using PBC functions, rather than storing it on each node. This keeps the node size down and prevents the memory layout from changing with the introduction of PBC, which would force existing users to have to rebuild serialized trees if they wanted to upgrade.
- tests: PBC tests refactored to move common brute force PBC check code and data/query/tree generating into separate function
- tests: PBC tests use neater float asserts from approx package
- tests: Rayon removed and PBC unit test query point and data point count reduced to improve unit test run time
- style: re-ran cargo fmt and cargo clippy and fixed clippy lints that were introduced
Any chance you could take a look and see what you think?
Thanks!
These all sound like great changes. Curious to see the performance. I'll check it out very soon.