haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Upgrade Haystack 1.x to NLTK 3.9

Open julian-risch opened this issue 1 year ago • 1 comments

In Haystack 1.26.x we should replace the nltk.download("punkt") with nltk.download('punkt_tab') here https://github.com/deepset-ai/haystack/blob/883cd466bd0108ff4f6af4c389f0e42fabc1282c/haystack/nodes/preprocessor/preprocessor.py#L123 so that users can use Haystack 1.26.x with NLTK 3.9. Prior NLTK versions are affected by https://nvd.nist.gov/vuln/detail/CVE-2024-39705. We should therefore also pin NLTK to >=3.9.

While the NLTK release notes list 3.8.2 https://pypi.org/project/nltk/#history with the fix, that release disappeared from pypi. https://pypi.org/project/nltk/#history There is a comment on GitHub saying that the release was deleted and there will be a 3.9 https://github.com/nltk/nltk/issues/3301#issuecomment-2290843549

julian-risch avatar Aug 15 '24 13:08 julian-risch

Our pre-processing pipeline broke since the nltk update because of the following issues:

  1. Unpickling restricted to simple types to avoid CVE issues (the reason we have to upgrade from 3.8.1 asap). So use "PunktTokenizer" to safely load the nltk model rather than using "nltk.data.load".
  2. "punkt" package contains unsafe pickles and is replaced by "punkt_tab" [https://github.com/nltk/nltk/issues/3293]

We had to create a custom preprocessor component to fix the above issues. Happy to contribute to a fix if needed.

sagarneeldubey avatar Aug 15 '24 18:08 sagarneeldubey

Closing as https://github.com/deepset-ai/haystack/pull/8256 has been integrated on 1.26.x branch and a new 1.26.3 release has been released with this fix.

vblagoje avatar Aug 29 '24 14:08 vblagoje

We upgraded farm-haystack to 1.26.3 today and we confirm that the preprocessor is working fine with nltk 3.9.1 , so we don't need the custom preprocessor anymore. Thanks a lot for your prompt response to this issue!

sagarneeldubey avatar Aug 30 '24 20:08 sagarneeldubey

Awesome, thanks for reporting back @sagarneeldubey much appreciated 🚀

vblagoje avatar Sep 01 '24 17:09 vblagoje