pykdtree
pykdtree copied to clipboard
pydktree breaks down for very large number of data points
I have been using pykdtree to obtain nearest neighbours and it seems that it breaks down for a very large dataset. I managed to reproduce the problem with the following example:
from pykdtree.kdtree import KDTree
import numpy as np
pos = np.random.rand(int(5e8),3)
nb = 32
tree = KDTree(pos)
d, idx = tree.query(pos, k=nb)
h = d[:,nb-1]
print np.min(h)
The result of the previous code is that the minimum distance to the 32nd neighbouring particles to some particle is zero, which is incorrect and, indeed very unlikely. It turns out that zero is assigned to many more than just one particle. I fact, it is zero for a large fraction of the particles. Doing
import numpy as np
k, = np.where(h == 0)
print(len(k))
returns 365782272. I.e., it is 0 for ~73 % of the whole sample. This is clearly the wrong answer.
I discovered the problem when using py-sphviewer, which relies on pyktree to find the smoothing length of particles in cosmological simulations. When the number of particles within the simulated volumes is very large (several hundreds of millions), pykdtree assigns a wrong distance of 0 between individual particles and their 32nd neighbours.
Any idea on what might be causing this weird behaviour? I also checked with either single and double precision.
It's a pointer arithmetics overflow problem: https://github.com/storpipfugl/pykdtree/blob/master/pykdtree/_kdtree_core.c#L1410
I'll look into giving pykdtree an overhaul to support contemporary data set sizes