No single, complete solution for packaging data
I'd like to raise an issue which relates to #34, but is wider. It's not clear to me what the best practice is regarding including data (templates, config files, other resources) as part of a package.
There appear to be three options:
- a "MANIFEST.in" file and
include_package_data=Truein "setup.py" package_datadict in "setup.py"data_filesparameter in "setup.py"
Each of these options covers overlapping requirements, but it seems that none of them can cover all likely requirements - or it's not obvious which is the right one to use.
The package_data option seems the cleanest implementation to me, from the perspective of someone trying to package a project, if only it worked predictably every time.
I have seen advice which suggests using "MANIFEST.in", but this solution seems awkward, requiring a context-switch to different file and a specific configuration format, and being more awkward to generate programmatically. Maybe I've just not understood it, but documentation on this is difficult to find, and I've never seen advice which actually explains why this is best practice.
This is confusing, but the answer to how you do this today is that no matter what you need a MANIFEST.in. That is because the MANIFEST.in tells the toolchain what files to include inside of the .tar.gz when you run python setup.py sdist and you (obviously) can't install a file that isn't included inside of the .tar.gz.
Once you have a file inside of the .tar.gz, you then need to tell setuptools what files you want to install, and where you want to install them. You have two options here, either package_data and data_files. Between these two I suggest package_data because it should basically always work the same way whereas data_files has caveats.
Finally, the include_package_data=True is a shortcut that will automatically populate the package_data dictionary based on your MANIFEST.in.
Thanks for that clarification - much appreciated!
If you always need a MANIFEST.in why even have to set package_data=True? (I Just got burned by this recently as the extra files showed up in the source dist but not the wheel)
And maybe to answer my own question per @dstufft 's comment above... package_data and data_files also tell where those files need to be installed on the target system not just what needs to be included? is this correct?
Thanks Donald, I never understood what include_package_data was supposed to do! Populating a setup argument from manifest seems backward though.
For reference, in 2.7 and 3.x distutils includes these files (if present) by default:
- README or README.txt
- setup.py
- test/test*.py
- all pure Python modules mentioned in setup script
- all files pointed by package_data (build_py)
- all files defined in data_files.
- all files defined as scripts.
- all C sources listed as part of extensions or C libraries
in the setup script (doesn't catch C headers!)
drafter250: in a sense, they do:
-
package_data will cause the files to be included alongside the Python files, so in the same location than in the source tree, which is why code using __file__ + os.path.join will work during development and after installation, even though using pkgutil.get_data or pkg_resources helpers is cleaner and handles more cases such as zipped installations.
-
data_files contains the destination wished for the files listed, which is not cross-platform or cross-OS (think /usr/share/doc on Debian vs. something else on BSD vs. virtualenv install vs. Windows), which is why it is practically unusable for general-consumption libs.
Is using a MANIFEST.IN file with "include_package_data" still best practice today? I'm trying to package a handful of .csv files using setuptools and have not had any success.
I've tried:
- Using a MANIFEST.IN file coupled with "include_package_data=True"
- Not using a MANIFEST.IN file, but using "package_data={'': ['data/*.csv']}"
- Using include_package_data=True, package_data, and attempting pkg_resources as in this guide. Though I've also read that you should never use both "include_package_data=True" alongside package_data.
My repository is setup with this hierarchy: +--build +--dist +--package ___+--data ______+--file1.csv ______+--file2.csv ______+--file3.csv ___+--init.py ___+--fileA.py ___+--fileB.py ___+--fileC.py +--package.egg-info +--tests +--LICENSE +--README.md +--MANIFEST.IN (with and without) +--setup.py
where the setup file is written in the following manner:
setuptools.setup( .... packages=setuptools.find_packages(), include_package_data=True, package_data={'': ['data/*.csv']},
Hoping maybe the python community has settled on a best answer here!
EDIT: Here is a few more possible answers - still confused though
package_data={'': ['data/*.csv']}
keys in that dict need to be your package names:
package_data"={"package": ["data/*.csv"]}
I would try that with a single file inside package first, to see if it gets picked up in the sdist and wheel. Then add the complications of sub-directory and glob pattern.
I don't understand include_package_data and I am 100% convinced that it's the root of all evil. I have no idea why, but for some reason setting include_package_data to True and then also trying to define package_data without MANIFEST.in makes it so that the files are included in the wheels, but are not included in source dists.
This behaviour is completely ridiculous. And apparently this is a well known issue. So I'm wondering why isn't it fixed or documented with giant red blinking warning signs all over the place.