gensim
gensim copied to clipboard
Unnecessary dependency on FuzzyTM pulls in many libraries
Problem description
I'm trying to upgrade to the new Gensim 4.3.0 release. My colleague @juhoinkinen noticed in https://github.com/NatLibFi/Annif/pull/660 that Gensim 4.3.0 pulls in more dependencies than the previous release 4.2.0, including pandas. I suspect that at least the FuzzyTM dependency (which in turn pulls in pandas) is actually unused and thus unnecessary.
Steps/code/corpus to reproduce
Installing Gensim 4.2.0 into an empty venv (only four packages installed):
$ pip install gensim==4.2.0
Collecting gensim==4.2.0
Downloading gensim-4.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.0/24.0 MB 2.0 MB/s eta 0:00:00
Collecting scipy>=0.18.1
Downloading scipy-1.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.4/34.4 MB 3.3 MB/s eta 0:00:00
Collecting numpy>=1.17.0
Downloading numpy-1.24.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.3/17.3 MB 10.6 MB/s eta 0:00:00
Collecting smart-open>=1.8.1
Downloading smart_open-6.3.0-py3-none-any.whl (56 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.8/56.8 KB 9.7 MB/s eta 0:00:00
Installing collected packages: smart-open, numpy, scipy, gensim
Successfully installed gensim-4.2.0 numpy-1.24.1 scipy-1.10.0 smart-open-6.3.0
Installing Gensim 4.3.0 into an empty venv (18 packages installed):
$ pip install gensim==4.3.0
Collecting gensim==4.3.0
Downloading gensim-4.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.1/24.1 MB 6.9 MB/s eta 0:00:00
[...skipping downloads...]
Installing collected packages: pytz, urllib3, smart-open, six, numpy, idna, charset-normalizer, certifi, scipy, requests, python-dateutil, simpful, pandas, miniful, fst-pso, pyfume, FuzzyTM, gensim
Running setup.py install for miniful ... done
Running setup.py install for fst-pso ... done
Successfully installed FuzzyTM-2.0.5 certifi-2022.12.7 charset-normalizer-2.1.1 fst-pso-1.8.1 gensim-4.3.0 idna-3.4 miniful-0.0.6 numpy-1.24.1 pandas-1.5.2 pyfume-0.2.25 python-dateutil-2.8.2 pytz-2022.7 requests-2.28.1 scipy-1.10.0 simpful-2.9.0 six-1.16.0 smart-open-6.3.0 urllib3-1.26.13
The size of the venv has grown from 249MB to 318MB, an increase of 69MB.
Here is what pipdeptree shows - FuzzyTM appears to be the main reason why so many libraries are pulled in:
gensim==4.3.0
- FuzzyTM [required: >=0.4.0, installed: 2.0.5]
- numpy [required: Any, installed: 1.24.1]
- pandas [required: Any, installed: 1.5.2]
- numpy [required: >=1.21.0, installed: 1.24.1]
- python-dateutil [required: >=2.8.1, installed: 2.8.2]
- six [required: >=1.5, installed: 1.16.0]
- pytz [required: >=2020.1, installed: 2022.7]
- pyfume [required: Any, installed: 0.2.25]
- fst-pso [required: Any, installed: 1.8.1]
- miniful [required: Any, installed: 0.0.6]
- numpy [required: >=1.12.0, installed: 1.24.1]
- scipy [required: >=1.0.0, installed: 1.10.0]
- numpy [required: >=1.19.5,<1.27.0, installed: 1.24.1]
- numpy [required: Any, installed: 1.24.1]
- numpy [required: Any, installed: 1.24.1]
- scipy [required: Any, installed: 1.10.0]
- numpy [required: >=1.19.5,<1.27.0, installed: 1.24.1]
- simpful [required: Any, installed: 2.9.0]
- numpy [required: >=1.12.0, installed: 1.24.1]
- requests [required: Any, installed: 2.28.1]
- certifi [required: >=2017.4.17, installed: 2022.12.7]
- charset-normalizer [required: >=2,<3, installed: 2.1.1]
- idna [required: >=2.5,<4, installed: 3.4]
- urllib3 [required: >=1.21.1,<1.27, installed: 1.26.13]
- scipy [required: >=1.0.0, installed: 1.10.0]
- numpy [required: >=1.19.5,<1.27.0, installed: 1.24.1]
- scipy [required: Any, installed: 1.10.0]
- numpy [required: >=1.19.5,<1.27.0, installed: 1.24.1]
- numpy [required: >=1.18.5, installed: 1.24.1]
- scipy [required: >=1.7.0, installed: 1.10.0]
- numpy [required: >=1.19.5,<1.27.0, installed: 1.24.1]
- smart-open [required: >=1.8.1, installed: 6.3.0]
pip==22.0.2
pipdeptree==2.3.3
setuptools==59.6.0
It appears that the FuzzyTM dependency was added in PR #3398 (Flsamodel) by @ERijck . The first commits in this PR depended on the library, but a subsequent commit 9fec00b32d281e795f3b4701bf11fa1c97780227 reworked the code so it doesn't need to import FuzzyTM at all. But the dependency in setup.py wasn't actually removed, it's still there: https://github.com/RaRe-Technologies/gensim/blob/f35faae7a7b0c3c8586fb61208560522e37e0e7e/setup.py#L347
I think the FuzzyTM dependency could be safely dropped, as the library is not actually imported. It would reduce the number of libraries Gensim pulls in and thus reduce the size of installations, including Docker images where minimal size is often required.
Versions
I'm using Ubuntu Linux 22.04.
Linux-5.15.0-56-generic-x86_64-with-glibc2.35 Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] Bits 64 NumPy 1.24.1 SciPy 1.10.0 gensim 4.3.0 FAST_VERSION 0
Thanks for reporting!
@mpenkov Is fuzzyTM really a hard dependency? If so that's terrible, definitely an omission / bug (or if intentional, done in very bad taste). Let's release a bug fix ASAP.
I tracked the change in setup.py down to https://github.com/RaRe-Technologies/gensim/pull/3398. @ERijck why do you think this was needed, why did you add that line?
I'm surprised this line is still there, it was part of my first PR. The dependency can be removed from setup.py.