unstructured Slow execution when checking for the same package

https://github.com/Unstructured-IO/unstructured/blob/01dbc7b4733e88efd6c1e85930c707009a2a966e/unstructured/nlp/tokenize.py#L101-L113

Should prob use the cache here instead of on tokenizers: @lru_cache(maxsize=CACHE_MAX_SIZE) def check_for_nltk_package(package_name: str, package_category: str) -> bool:

Cache the result to avoid the os directory checks in find (find is very expensive) nltk.find(f"{package_category}/{package_name}", paths=paths)

Aug 20 '24 23:08 ffma-nate-rogan

Hi @ffma-nate-rogan - good suggestions. We'll take a look at that.

Aug 26 '24 14:08 MthwRobinson

@ffma-nate-rogan are you seeing an actual performance bottleneck on this or you're proposing this on principle?

By my reading, check_for_nltk_package() is called at most once with each set of parameters, both times from _download_nltk_packages_if_not_present() (which is itself cached, therefore the "at-most-once" claim).

Closing for now as unnecessary, but happy to reconsider if you think I've missed something.

Dec 19 '24 18:12 scanny