gensim icon indicating copy to clipboard operation
gensim copied to clipboard

don't install *.c *.cpp *.pxd *.pyx files

Open pabs3 opened this issue 3 years ago • 7 comments

Problem description

I am installing gensim with setup.py install. I expected that this would only install needed files. I noticed that this installs files such as *.c *.cpp *.pxd *.pyx that aren't needed when importing and using the gensim Python modules.

Steps/code/corpus to reproduce

These are the files that are installed but should not be. They are source code and are not used by the gensim Python modules, so should not get installed by setup.py install or in the wheels or elsewhere.

gensim/_matutils.c:                         C source, ASCII text
gensim/_matutils.pyx:                       a /usr/bin/env cython script, ASCII text executable
gensim/corpora/_mmreader.c:                 C source, ASCII text
gensim/corpora/_mmreader.pyx:               Python script, ASCII text executable
gensim/models/doc2vec_corpusfile.cpp:       C source, ASCII text
gensim/models/doc2vec_corpusfile.pyx:       a /usr/bin/env cython script, ASCII text executable
gensim/models/doc2vec_inner.cpp:            C source, ASCII text
gensim/models/doc2vec_inner.pxd:            a /usr/bin/env cython script, ASCII text executable
gensim/models/doc2vec_inner.pyx:            a /usr/bin/env cython script, ASCII text executable
gensim/models/fast_line_sentence.h:         C++ source, ASCII text
gensim/models/fasttext_corpusfile.cpp:      C source, ASCII text
gensim/models/fasttext_corpusfile.pyx:      a /usr/bin/env cython script, ASCII text executable
gensim/models/fasttext_inner.c:             C source, ASCII text
gensim/models/fasttext_inner.pxd:           a /usr/bin/env cython script, ASCII text executable
gensim/models/fasttext_inner.pyx:           a /usr/bin/env cython script, ASCII text executable
gensim/models/nmf_pgd.c:                    C source, ASCII text
gensim/models/nmf_pgd.pyx:                  Python script, ASCII text executable
gensim/models/stdint_wrapper.h:             C source, ASCII text
gensim/models/voidptr.h:                    C source, ASCII text
gensim/models/word2vec_corpusfile.cpp:      C source, ASCII text
gensim/models/word2vec_corpusfile.pxd:      ASCII text
gensim/models/word2vec_corpusfile.pyx:      a /usr/bin/env cython script, ASCII text executable
gensim/models/word2vec_inner.c:             C source, ASCII text
gensim/models/word2vec_inner.pxd:           ASCII text
gensim/models/word2vec_inner.pyx:           a /usr/bin/env cython script, ASCII text executable
gensim/similarities/fastss.c:               C source, ASCII text

These files are also in the official wheels gensim distributes on PyPI:

$ wget https://files.pythonhosted.org/packages/06/66/e875156aca2edf0416a8739894dc97b05429ebfa4ada934774361fbf25c7/gensim-4.1.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
$ unzip -l gensim-4.1.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl  | grep -v test_data | grep -v \\.py$ | grep -v cpython-39 | grep -v dist-info
Archive:  gensim-4.1.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
  Length      Date    Time    Name
---------  ---------- -----   ----
  1042135  2021-09-16 23:04   gensim/_matutils.c
     8966  2021-09-16 23:03   gensim/_matutils.pyx
    10796  2021-09-16 23:03   gensim/models/fasttext_corpusfile.pyx
      527  2021-09-16 23:03   gensim/models/stdint_wrapper.h
    31383  2021-09-16 23:03   gensim/models/doc2vec_inner.pyx
   448071  2021-09-16 23:04   gensim/models/doc2vec_corpusfile.cpp
   542851  2021-09-16 23:04   gensim/models/fasttext_inner.c
   334069  2021-09-16 23:04   gensim/models/fasttext_corpusfile.cpp
      310  2021-09-16 23:03   gensim/models/voidptr.h
    38384  2021-09-16 23:03   gensim/models/word2vec_inner.pyx
     3621  2021-09-16 23:03   gensim/models/doc2vec_inner.pxd
   780658  2021-09-16 23:04   gensim/models/nmf_pgd.c
     1842  2021-09-16 23:03   gensim/models/nmf_pgd.pyx
    17740  2021-09-16 23:03   gensim/models/word2vec_corpusfile.pyx
    24392  2021-09-16 23:03   gensim/models/fasttext_inner.pyx
   590839  2021-09-16 23:04   gensim/models/doc2vec_inner.cpp
     2166  2021-09-16 23:03   gensim/models/word2vec_corpusfile.pxd
    24883  2021-09-16 23:03   gensim/models/doc2vec_corpusfile.pyx
   606169  2021-09-16 23:04   gensim/models/word2vec_inner.c
     1200  2021-09-16 23:03   gensim/models/fast_line_sentence.h
     5310  2021-09-16 23:03   gensim/models/word2vec_inner.pxd
   630097  2021-09-16 23:04   gensim/models/word2vec_corpusfile.cpp
     4873  2021-09-16 23:03   gensim/models/fasttext_inner.pxd
     7365  2021-09-16 23:03   gensim/corpora/_mmreader.pyx
   448463  2021-09-16 23:04   gensim/corpora/_mmreader.c
   335535  2021-09-16 23:04   gensim/similarities/fastss.c

Versions

Linux-5.16.0-2-amd64-x86_64-with-glibc2.33
Python 3.9.10 (main, Feb 22 2022, 13:54:07) 
[GCC 11.2.0]
Bits 64
NumPy 1.21.5
SciPy 1.7.3
gensim 4.1.2
FAST_VERSION 1

pabs3 avatar Feb 28 '22 08:02 pabs3

Woops, forgot to include the *.pyx files. Description updated.

pabs3 avatar Feb 28 '22 08:02 pabs3

Thanks! Yeah I don't think those files are needed for anything, once Gensim has been installed.

A fix should be as easy as excluding those files in MANIFEST, right? Or what else is needed, could you open a PR please?

piskvorky avatar Feb 28 '22 10:02 piskvorky

I think MANIFEST is for source distributions (such as the git repository or the PyPI source tarball), not binary ones like wheels.

The .c/.cpp files are generated files so they shouldn't really be in git (which they aren't) nor in any other source distributions (such as the PyPI source tarball) and thus should not be in MANIFEST, but the *.pxd *.pyx files are source files not generated files, so they definitely should be in MANIFEST.

I think the place to fix the results of setup.py install and the contents of the wheels is probably in setup.py, but I don't know enough about how that does Python extensions stuff to say what should change.

pabs3 avatar Feb 28 '22 10:02 pabs3

@mpenkov WDYT? Not urgent (low impact) so removing the Next Release flag. But if trivial, we might as well clean up these files too.

piskvorky avatar Feb 28 '22 11:02 piskvorky

These files are necessary for the local-build (including native compilation) that must happen as part of a local setup.py install installation from a source directory. And, it's quite natural to place them alongside the .py files during that compilation step, and not to auto-delete them unless the user requests it (as via some setup.py clean subcommand). So the fact that they live in the source-checkout directory seems proper.

And, even when a package has compiled elements, it's little cost, & sometimes a benefit, to have the related source-code alongside the more-obfuscated compiled code.

From a quick glance at numpy's MANIFEST.in, they seem to include some .pyx & .c files.

So what's the harm in retaining these?

gojomo avatar Mar 16 '22 21:03 gojomo

Good question – what's the motivation here @pabs3 ? Freeing up disk space? Or do these files make some local "modify-and-recompile" workflow harder?

piskvorky avatar Mar 16 '22 22:03 piskvorky

The motivation is simply that they are not used at runtime so logically they should not be present in the binary wheels, nor in the directory that setup.py install puts files in.

The impact is very minor, just unused files downloaded in binary packages and unused files stored on disk.

The files are needed at setup.py build time but not at setup.py install time.

The removal of the .c .cpp files only at setup.py clean time is correct of course.

The location of the .c .cpp files during setup.py build should be in the build/ directory, since they are generated from the .pyx .pxd files.

The .c .cpp files should not be in the file produced by setup.py sdist, since they are generated from the .pyx .pxd files, so they shouldn't be in the MANIFEST.in file.

The only time the .c .cpp files would be useful would be when a gensim developer needs to debug an issue, but in that situation the source tree would almost always be available locally including the .c .cpp files.

Feel free to close the issue wontfix if you prefer to keep the files.

pabs3 avatar Apr 02 '22 06:04 pabs3