neofuzz icon indicating copy to clipboard operation
neofuzz copied to clipboard

neofuzz indexing fails for list of 400K strings

Open SeanPedersen opened this issue 1 year ago • 2 comments

Error message:

python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
zsh: killed     python neofuzztest.py

Code:

import random
import string
from neofuzz import char_ngram_process


def rand_str(length):
    characters = string.ascii_letters + string.digits
    return "".join(random.choice(characters) for _ in range(length))


names = [
    rand_str(8) + " " + rand_str(6) + " " + rand_str(4) + " " + str(i)
    for i in range(400_000)
]
print(len(names))

neofuzz_process = char_ngram_process()
neofuzz_process.index(names)

query = "test 3333"

pre_filter = neofuzz_process.extract(query, limit=2000, refine_levenshtein=True)
print(pre_filter[:10])

The blazing fast speed of this lib can only shine if working on large datasets.

SeanPedersen avatar Oct 26 '24 22:10 SeanPedersen

hmm interesting... Thanks for taking your time to look into this. Can I get a full error log? I have a feeling this might have something to do with PyNNDescent

x-tabdeveloping avatar Oct 27 '24 13:10 x-tabdeveloping

Hey @SeanPedersen ! I have updated the library to use Annoy as a backend instead of PyNNDescent. Can you check if this issue still stands?

x-tabdeveloping avatar Apr 20 '25 14:04 x-tabdeveloping