NearestNeighbors.jl icon indicating copy to clipboard operation
NearestNeighbors.jl copied to clipboard

Inf points causes corruption with BallTree

Open BioTurboNick opened this issue 7 years ago • 13 comments

To implement an exclusive nearest neighbor algorithm, I 'removed' points from my points array by setting their values to Inf and rebuilding the BallTree from the edited array. I did this to avoid copying the array each time and to preserve indexing.

However, the results output by knn are incorrect and unstable. Once the tree is created, it will always produce the same wrong output. However, regenerating the tree on the same data and trying knn again can produce different wrong output. The indexes are often some absurdly large integer such as 874275200 with an Inf distance. This is despite the fact that there are definitely points with non-infinite values in the tree that should be chosen as nearest neighbors.

The same dataset produces correct and consistent results when using BruteTree and KDTree.

I'll see if I can provide a reproduction. The issue didn't start to occur until ~half the points were Inf.

BioTurboNick avatar Nov 29 '18 23:11 BioTurboNick

Hmm... in another case I observed the calculated nearest neighbor distance vary between runs even though the index was correct. Couldn't get it to reproduce reliably.

However, here's a reproduction of the above:

coords = [29882.5 25974.3 Inf  Inf 17821.8 Inf Inf Inf Inf Inf 16322.0; 9279.86 9286.35 Inf Inf 10320.4 Inf Inf Inf Inf Inf 11459.0; 0.0 0.0 Inf Inf 0.0 Inf Inf Inf Inf Inf 0.0]
point =  [17889.55, 2094.45, 0.0]
tree = BallTree(coords)
knn(tree, point, 1) # returns ([_______], [Inf])

tree = BruteTree(coords)
knn(tree, point, 1) # returns ([5], [8226.23])

tree = KDTree(coords)
knn(tree, point, 1) # returns ([5], [8226.23])

BioTurboNick avatar Nov 30 '18 01:11 BioTurboNick

Seems like something is wrong for BallTree in the presence of Inf?

KristofferC avatar Nov 30 '18 14:11 KristofferC

Just for the record, I realized that the skip argument to the various methods can provide what I was trying to do with setting points to Inf.

BioTurboNick avatar Aug 26 '19 17:08 BioTurboNick

Hello!

I'm experiencing similar problems with BallTrees on perfectly finite data; it seems to happen also when some of the points from which the tree is constructed are either very close together, almost equal or equal.

Should I try to get the exact data that trigger the error? (it's pretty deep now)

Thanks

exaexa avatar Feb 24 '20 10:02 exaexa

Having a small reproducer would make it easier to get to the bottom of the issue here.

KristofferC avatar Feb 24 '20 10:02 KristofferC

OK, will try ASAP (the tooling around is a bit complicated)

exaexa avatar Feb 24 '20 10:02 exaexa

OK, will try ASAP (the tooling around is a bit complicated)

I was trying to reproduce the issue generating random data, but It seems fine so far. Can you at least say the number of points and dimensions you've had a problem with? I noticed before the size of the array can influence the likelihood to trigger the problem.

nlw0 avatar Oct 02 '21 12:10 nlw0

Hi -- sorry, I kindof missed my reminder and now I see it's already a year :D The data was around 30 dimensions, roughly 1000s of points, not more. You might have luck generating the same issue with same or almost-same points, try making a balltree out of a random cloud where all values are divided by 10^9, 10^10, etc.., or so. I'll add this to the TODOs for the next week and will let you know.

exaexa avatar Oct 02 '21 13:10 exaexa

I may have a MWE for this issue:

using NearestNeighbors

using LinearAlgebra: norm
using Random:randperm

Ndim = 1
Npt = 11

data = randn(Ndim, Npt)

pa = randperm(Npt)[1:2]
data[:,pa] .= Inf

display(data)

tree = BallTree(data)
@show point = randn(Ndim)
distances = mapslices(norm, data .- point, dims=1)
gtc = argmin(distances)
@show gt = gtc[2]
@show idx, dist = nn(tree, point)

display([point  data[:,[gt, idx]]])

@assert idx == gt

Example output:

julia> include("/home/user/src/nntest.jl");
1×11 Matrix{Float64}:
 -0.084283  -0.489114  -0.808527  0.245605  -0.0784464  -2.12956  Inf  -0.395751  Inf  -0.24521  -0.549262
point = randn(Ndim) = [-1.1180802696153354]
gt = gtc[2] = 3
(idx, dist) = nn(tree, point) = (6, Inf)
1×3 Matrix{Float64}:
 -1.11808  -0.808527  -2.12956
ERROR: LoadError: AssertionError: idx == gt

A number or dimensions as low as 1 works. The number of points is a little more finicky, but with 11 it's pretty reliable to cause the problem. With 10 or 12 you don't see it. I believe with more points you need more "infs" to cause the issue.

Most of the times either the first point or the last gets picked, but you can get points within the valid range as well, as in the example above.

With a single "Inf" there's no issue, you need at least 2.

nlw0 avatar Oct 02 '21 14:10 nlw0

Interesting, I guess my issue is a bit different then, I'm pretty sure I had no Infs there. Will check and let you know.

exaexa avatar Oct 02 '21 14:10 exaexa

I've looked more into this, but I'm not very knowledgeable of ball trees. I think the issue with Infs is that the model cannot naturally deal with it. If you look into create_bsphere, what need to compute a center and a radius from a set of points, and what ends up happening is that Inf values will throw a center to Inf, but then you cannot compute a radius, so we get NaN. If you look into the tree.hyper_spheres you'll see things are messed up. What should a ball tree model look like with Inf data values? I'm not sure. So maybe it's more of a case of "not a bug"...

Not sure about @exaexa 's report, I couldn't find issues with regular numbers so far, so maybe it's some subtle float arithmetics thing?

nlw0 avatar Oct 03 '21 10:10 nlw0

Could we just be a bit careful so instead of getting a center at Inf with a NaN radius, you get a ball at 0.0 with an Inf radius. Perhaps things can be made to "just work"...

KristofferC avatar Oct 03 '21 11:10 KristofferC

I'm not sure there's a way to make it work naturally... If you think about it, going for a neat solution, moving a single point to infinity should maybe cause the hypersphere to turn into a hyperplane touching the first inline point of the set in the normal direction. With multiple Inf, things start getting complicated, though. And this does not seem to go in line with the idea that making points Inf we are kind of "discarding" them. They may not be returned in the query, for sure, but this seems to be contradict very strongly the assumptions of a BallTree. I'm not sure what can be done, though, should we ignore points with Inf or just leave it up to the user?... We might be able to check when the mean is computed, but perhaps the best would be to perform a first pass looking for points to be discarded.

nlw0 avatar Oct 05 '21 09:10 nlw0