gensim
gensim copied to clipboard
don't install *.c *.cpp *.pxd *.pyx files
Problem description
I am installing gensim with setup.py install. I expected that this would only install needed files. I noticed that this installs files such as *.c *.cpp *.pxd *.pyx that aren't needed when importing and using the gensim Python modules.
Steps/code/corpus to reproduce
These are the files that are installed but should not be. They are source code and are not used by the gensim Python modules, so should not get installed by setup.py install or in the wheels or elsewhere.
gensim/_matutils.c: C source, ASCII text
gensim/_matutils.pyx: a /usr/bin/env cython script, ASCII text executable
gensim/corpora/_mmreader.c: C source, ASCII text
gensim/corpora/_mmreader.pyx: Python script, ASCII text executable
gensim/models/doc2vec_corpusfile.cpp: C source, ASCII text
gensim/models/doc2vec_corpusfile.pyx: a /usr/bin/env cython script, ASCII text executable
gensim/models/doc2vec_inner.cpp: C source, ASCII text
gensim/models/doc2vec_inner.pxd: a /usr/bin/env cython script, ASCII text executable
gensim/models/doc2vec_inner.pyx: a /usr/bin/env cython script, ASCII text executable
gensim/models/fast_line_sentence.h: C++ source, ASCII text
gensim/models/fasttext_corpusfile.cpp: C source, ASCII text
gensim/models/fasttext_corpusfile.pyx: a /usr/bin/env cython script, ASCII text executable
gensim/models/fasttext_inner.c: C source, ASCII text
gensim/models/fasttext_inner.pxd: a /usr/bin/env cython script, ASCII text executable
gensim/models/fasttext_inner.pyx: a /usr/bin/env cython script, ASCII text executable
gensim/models/nmf_pgd.c: C source, ASCII text
gensim/models/nmf_pgd.pyx: Python script, ASCII text executable
gensim/models/stdint_wrapper.h: C source, ASCII text
gensim/models/voidptr.h: C source, ASCII text
gensim/models/word2vec_corpusfile.cpp: C source, ASCII text
gensim/models/word2vec_corpusfile.pxd: ASCII text
gensim/models/word2vec_corpusfile.pyx: a /usr/bin/env cython script, ASCII text executable
gensim/models/word2vec_inner.c: C source, ASCII text
gensim/models/word2vec_inner.pxd: ASCII text
gensim/models/word2vec_inner.pyx: a /usr/bin/env cython script, ASCII text executable
gensim/similarities/fastss.c: C source, ASCII text
These files are also in the official wheels gensim distributes on PyPI:
$ wget https://files.pythonhosted.org/packages/06/66/e875156aca2edf0416a8739894dc97b05429ebfa4ada934774361fbf25c7/gensim-4.1.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
$ unzip -l gensim-4.1.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl | grep -v test_data | grep -v \\.py$ | grep -v cpython-39 | grep -v dist-info
Archive: gensim-4.1.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Length Date Time Name
--------- ---------- ----- ----
1042135 2021-09-16 23:04 gensim/_matutils.c
8966 2021-09-16 23:03 gensim/_matutils.pyx
10796 2021-09-16 23:03 gensim/models/fasttext_corpusfile.pyx
527 2021-09-16 23:03 gensim/models/stdint_wrapper.h
31383 2021-09-16 23:03 gensim/models/doc2vec_inner.pyx
448071 2021-09-16 23:04 gensim/models/doc2vec_corpusfile.cpp
542851 2021-09-16 23:04 gensim/models/fasttext_inner.c
334069 2021-09-16 23:04 gensim/models/fasttext_corpusfile.cpp
310 2021-09-16 23:03 gensim/models/voidptr.h
38384 2021-09-16 23:03 gensim/models/word2vec_inner.pyx
3621 2021-09-16 23:03 gensim/models/doc2vec_inner.pxd
780658 2021-09-16 23:04 gensim/models/nmf_pgd.c
1842 2021-09-16 23:03 gensim/models/nmf_pgd.pyx
17740 2021-09-16 23:03 gensim/models/word2vec_corpusfile.pyx
24392 2021-09-16 23:03 gensim/models/fasttext_inner.pyx
590839 2021-09-16 23:04 gensim/models/doc2vec_inner.cpp
2166 2021-09-16 23:03 gensim/models/word2vec_corpusfile.pxd
24883 2021-09-16 23:03 gensim/models/doc2vec_corpusfile.pyx
606169 2021-09-16 23:04 gensim/models/word2vec_inner.c
1200 2021-09-16 23:03 gensim/models/fast_line_sentence.h
5310 2021-09-16 23:03 gensim/models/word2vec_inner.pxd
630097 2021-09-16 23:04 gensim/models/word2vec_corpusfile.cpp
4873 2021-09-16 23:03 gensim/models/fasttext_inner.pxd
7365 2021-09-16 23:03 gensim/corpora/_mmreader.pyx
448463 2021-09-16 23:04 gensim/corpora/_mmreader.c
335535 2021-09-16 23:04 gensim/similarities/fastss.c
Versions
Linux-5.16.0-2-amd64-x86_64-with-glibc2.33
Python 3.9.10 (main, Feb 22 2022, 13:54:07)
[GCC 11.2.0]
Bits 64
NumPy 1.21.5
SciPy 1.7.3
gensim 4.1.2
FAST_VERSION 1
Woops, forgot to include the *.pyx files. Description updated.
Thanks! Yeah I don't think those files are needed for anything, once Gensim has been installed.
A fix should be as easy as excluding those files in MANIFEST, right? Or what else is needed, could you open a PR please?
I think MANIFEST is for source distributions (such as the git repository or the PyPI source tarball), not binary ones like wheels.
The .c/.cpp files are generated files so they shouldn't really be in git (which they aren't) nor in any other source distributions (such as the PyPI source tarball) and thus should not be in MANIFEST, but the *.pxd *.pyx files are source files not generated files, so they definitely should be in MANIFEST.
I think the place to fix the results of setup.py install and the contents of the wheels is probably in setup.py, but I don't know enough about how that does Python extensions stuff to say what should change.
@mpenkov WDYT? Not urgent (low impact) so removing the Next Release flag. But if trivial, we might as well clean up these files too.
These files are necessary for the local-build (including native compilation) that must happen as part of a local setup.py install installation from a source directory. And, it's quite natural to place them alongside the .py files during that compilation step, and not to auto-delete them unless the user requests it (as via some setup.py clean subcommand). So the fact that they live in the source-checkout directory seems proper.
And, even when a package has compiled elements, it's little cost, & sometimes a benefit, to have the related source-code alongside the more-obfuscated compiled code.
From a quick glance at numpy's MANIFEST.in, they seem to include some .pyx & .c files.
So what's the harm in retaining these?
Good question – what's the motivation here @pabs3 ? Freeing up disk space? Or do these files make some local "modify-and-recompile" workflow harder?
The motivation is simply that they are not used at runtime so logically they should not be present in the binary wheels, nor in the directory that setup.py install puts files in.
The impact is very minor, just unused files downloaded in binary packages and unused files stored on disk.
The files are needed at setup.py build time but not at setup.py install time.
The removal of the .c .cpp files only at setup.py clean time is correct of course.
The location of the .c .cpp files during setup.py build should be in the build/ directory, since they are generated from the .pyx .pxd files.
The .c .cpp files should not be in the file produced by setup.py sdist, since they are generated from the .pyx .pxd files, so they shouldn't be in the MANIFEST.in file.
The only time the .c .cpp files would be useful would be when a gensim developer needs to debug an issue, but in that situation the source tree would almost always be available locally including the .c .cpp files.
Feel free to close the issue wontfix if you prefer to keep the files.