Slow execution when checking for the same package
https://github.com/Unstructured-IO/unstructured/blob/01dbc7b4733e88efd6c1e85930c707009a2a966e/unstructured/nlp/tokenize.py#L101-L113
Should prob use the cache here instead of on tokenizers:
@lru_cache(maxsize=CACHE_MAX_SIZE)
def check_for_nltk_package(package_name: str, package_category: str) -> bool:
Cache the result to avoid the os directory checks in find (find is very expensive)
nltk.find(f"{package_category}/{package_name}", paths=paths)
Hi @ffma-nate-rogan - good suggestions. We'll take a look at that.
@ffma-nate-rogan are you seeing an actual performance bottleneck on this or you're proposing this on principle?
By my reading, check_for_nltk_package() is called at most once with each set of parameters, both times from _download_nltk_packages_if_not_present() (which is itself cached, therefore the "at-most-once" claim).
Closing for now as unnecessary, but happy to reconsider if you think I've missed something.