pybktree icon indicating copy to clipboard operation
pybktree copied to clipboard

Fix adding duplicate items

Open Bladieblah opened this issue 3 years ago • 5 comments

There's no check whether an item already exists in the tree, leading to unnecessarily deep trees, this should fix that issue.

Bladieblah avatar Nov 09 '21 11:11 Bladieblah

Thanks. This looks good at first glance, but I don't think it's correct. You can have multiple items with zero distance between them. For example, in the case of image hashing, consider the case when more than one image has the same image hash:


import collections
import pybktree

Image = collections.namedtuple('Image', 'data hash')

def image_distance(a, b):
    return pybktree.hamming_distance(a.hash, b.hash)

image1 = Image('one', 123)
image2 = Image('two', 42)
image3 = Image('three', 123)

t = pybktree.BKTree(image_distance)
t.add(image1)
t.add(image2)
t.add(image3)
print(list(t))
# [Image(data='one', hash=123), Image(data='two', hash=42), Image(data='three', hash=123)]

Iterating over the tree should produce all three images, even though image1 and image3 have the same hash.

What's your use case where this has come up?

benhoyt avatar Nov 09 '21 22:11 benhoyt

Good point, I didn't realise hashes might be used. I was using it to cluster transaction data (textual data), which contain many duplicate samples. I ended with very deep trees, in one case a tree with 27k layers where it should have been 3 (which can still be many nodes of course). I based this change on the wikipedia article explaining the insertion procedure. The article also only considers textual data though, which probably explains the oversight.

In this case it could be solved by replacing the check on distance with a check if the two items are identical, right?

Bladieblah avatar Nov 10 '21 08:11 Bladieblah

I added the check, but also kept the check on distance since that one is probably less expensive, especially in the case of images.

Bladieblah avatar Nov 10 '21 08:11 Bladieblah

Hmm, this does change the behavior (as shown by the failing doctests), because now you can't add more than one of the same item to the tree. This might be more correct, but I'm hesitant to change the behavior without releasing a new major version (which I'm probably not going to do at this point). My suggestion, if you need this, would be to check that you're not adding an equal/identical item with a separate dict.

benhoyt avatar Nov 10 '21 09:11 benhoyt

it's nice if we have a flag to modify the behavior.

c01o avatar May 29 '23 22:05 c01o