Upgrade Haystack 1.x to NLTK 3.9
In Haystack 1.26.x we should replace the nltk.download("punkt") with nltk.download('punkt_tab') here https://github.com/deepset-ai/haystack/blob/883cd466bd0108ff4f6af4c389f0e42fabc1282c/haystack/nodes/preprocessor/preprocessor.py#L123
so that users can use Haystack 1.26.x with NLTK 3.9. Prior NLTK versions are affected by https://nvd.nist.gov/vuln/detail/CVE-2024-39705. We should therefore also pin NLTK to >=3.9.
While the NLTK release notes list 3.8.2 https://pypi.org/project/nltk/#history with the fix, that release disappeared from pypi. https://pypi.org/project/nltk/#history There is a comment on GitHub saying that the release was deleted and there will be a 3.9 https://github.com/nltk/nltk/issues/3301#issuecomment-2290843549
Our pre-processing pipeline broke since the nltk update because of the following issues:
- Unpickling restricted to simple types to avoid CVE issues (the reason we have to upgrade from 3.8.1 asap). So use "PunktTokenizer" to safely load the nltk model rather than using "nltk.data.load".
- "punkt" package contains unsafe pickles and is replaced by "punkt_tab" [https://github.com/nltk/nltk/issues/3293]
We had to create a custom preprocessor component to fix the above issues. Happy to contribute to a fix if needed.
Closing as https://github.com/deepset-ai/haystack/pull/8256 has been integrated on 1.26.x branch and a new 1.26.3 release has been released with this fix.
We upgraded farm-haystack to 1.26.3 today and we confirm that the preprocessor is working fine with nltk 3.9.1 , so we don't need the custom preprocessor anymore. Thanks a lot for your prompt response to this issue!
Awesome, thanks for reporting back @sagarneeldubey much appreciated 🚀