Executable cannot start without PythonPackagingPolicy.include_distribution_sources = True
I'm using PyOxidizer v0.9.0 on macOS.
Reproducing the actual bug
Place the following in pyoxidizer.bzl in an otherwise empty directory and execute pyoxidizer run.
def make():
dist = default_python_distribution()
python_config = dist.make_python_interpreter_config()
python_config.config_profile = "python"
python_config.run_mode = "eval:import sys; from pprint import pprint; pprint(sys.argv)"
policy = dist.make_python_packaging_policy()
policy.set_resource_handling_mode("classify")
policy.include_distribution_sources = False
policy.bytecode_optimize_level_zero = True
policy.include_distribution_resources = True
exe = dist.to_python_executable(
name="pyapp",
packaging_policy=policy,
config=python_config,
)
files = FileManifest()
files.add_python_resource(".", exe)
return files
register_target("install", make, default=True)
resolve_targets()
It compiles just fine. When it runs, it produces the following error message (paths shortened for clarity; also, despite the output below, my current shell session defines no PYTHON* environment variables):
Python path configuration:
PYTHONHOME = './build/x86_64-apple-darwin/debug/install'
PYTHONPATH = (not set)
program name = './build/x86_64-apple-darwin/debug/install/./pyapp'
isolated = 0
environment = 1
user site = 1
import site = 1
sys._base_executable = './build/x86_64-apple-darwin/debug/install/./pyapp'
sys.base_prefix = './build/x86_64-apple-darwin/debug/install'
sys.base_exec_prefix = './build/x86_64-apple-darwin/debug/install'
sys.executable = './build/x86_64-apple-darwin/debug/install/./pyapp'
sys.prefix = './build/x86_64-apple-darwin/debug/install'
sys.exec_prefix = './build/x86_64-apple-darwin/debug/install'
sys.path = [
'./build/x86_64-apple-darwin/debug/install/lib/python38.zip',
'./build/x86_64-apple-darwin/debug/install/lib/python3.8',
'./build/x86_64-apple-darwin/debug/install/lib/python3.8/lib-dynload',
]
during initializing Python main: init_fs_encoding: failed to get the Python codec of the filesystem encoding
My best guess is that the executable could not import the encodings Python package:
init_fs_encoding calls config_get_codec_name calls _PyCodec_Lookup calls _PyCodecRegistry_Init imports encodings.
Variations
The following modifications don't seem to fix the bug.
- Changing the
set_resource_handling_modemode to"files" - Setting
bytecode_optimize_level_zerotoFalse - Setting
include_distribution_resourcestoFalse
Expected Behavior
I would have expected
policy.include_distribution_sources = False
policy.bytecode_optimize_level_zero = True
to include the Python distribution pure-Python code as byte code only without the source code. I had hoped this was the case because that would cut down on the size of the executable.
Once that failed, I was hoping that
policy.include_distribution_sources = False
policy.include_distribution_resources = True
would force the inclusion of Python distribution resources where resources refers to PyOxidizer Starlark types (as I explain below, I gather that this is not the case).
Why I came up with this expectation
The documentation for include_distribution_sources and include_distribution_resources is a little vague. I'm going to quote it and some of its context at length because it all affects my understanding.
include_distribution_sources(
bool)Whether to add source code for Python modules in the Python distribution.
Default is
True.
include_distribution_resources(
bool)Whether to add Python package resources for Python packages in the Python distribution.
Default is
False.
include_file_resources(
bool)Whether File resources have their
add_includeattribute set toTrueby default.Default is
False.
include_non_distribution_sources(
bool)Whether to add source code for Python modules not in the Python distribution.
I understood include_distribution_sources and include_non_distribution_sources to have the same behavior but affecting PythonModuleSources from the PythonDistribution versus, say, PythonExecutable.pip_install: If bytecode_optimize_level_zero = True, then the byte code is compiled and included in the binary regardless of whether the source code is also embedded in the binary. Setting include_[non_]distribution_sources to False doesn't preclude the importation of the relevant modules, but does exclude the source code from inclusion in the executable.
I understood include_file_resources and include_distribution_resources as being quite different from each other despite the overlapping use of the term resource. The resources that include_file_resources refers to are the PyOxidizer File resource objects, whereas the resources that include_distribution_resources refers to are those readable at runtime with importlib.resources.
Exactly the same issue on Catalina and v0.10.3, packaging simple hello world with pypika and
include_distribution_sources = False
The author of this library made enormous excellent work, but i'm curious, why other issues take precedence over this one ? While executable is fast and it is sure just a single file, being under 100mb is quite "undistributive", since the level of size is somewhere near VM like Go / JVM. I managed to pack the same logic (with Java) into 18mb using graalvm and 8 mb with Pyinstaller (yep, specific GLIBC version is a pain). I have high hopes with PyOxidizer and ability to use include_distribution_sources as intended.
@indygreg thanks in advance!
i'm curious, why other issues take precedence over this one
PyOxidizer is a huge project! Surely its maintainer is making it for his own purposes, and he's kind enough to share the results with us 😄
Proposal
@up-to-you I can help brainstorm for and review PRs, but there are other open issues for PyOxidizer that are a higher priority for me, so I don't want to own this one at this time. Below is my main idea of how you or someone else could tackle this. It would involve breaking changes to the API, so we should get @indygreg's input and approval before beginning coding.
- Add a
PythonModuleBytecodetype to complementPythonModuleSource.- Methods such as
PythonExecutable.pip_installwould include in their returnedlists aPythonModuleBytecodeinstead ofPythonModuleSourcewhen, e.g., installing an sdist that contains.pycs. - Add a method with the signature
PythonModuleSource.compile(optimization=0) -> PythonModuleBytecodeso that callbacks registered withPythonPackagingPolicy.register_resource_callbackcan force inclusion of the byte code rather than the source code -
PythonDistribution.python_resources's results would be (already are?) subject to that callback as well, giving the callback fine grained control over the standard library as well asPythonExecutable.pip_installed packages.
- Methods such as
- Replace
PythonPackagingPolicy.include_distribution_sourcesandPythonPackagingPolicy.include_non_distribution_sourceswith a newPythonPackagingPolicy.include_sources- If you want a small binary, you set
policy.include_sources = True. If you want easy debugging, you setpolicy.include_sources = False. Niche use cases can use the callback described above for full control.
- If you want a small binary, you set
- Replace
PythonPackagingPolicy.include_distribution_resourceswith a newPythonPackagingPolicy.include_package_resourcesthat applies to both the standard library andPythonExecutable.pip_installed packages.- Documentation should link to
importlib.resourcesto clarify that resources here means Python package data files rather than one of the*ResourceStarlark types.
- Documentation should link to
- Remove
PythonPackagingPolicy.include_classified_resourcesand recommend instead settingPythonPackagingPolicy.file_scanner_classify_filestoFalseto avoid including classified resources- Again, use a callback for fine-grained control
- Remove
PythonPackagingPolicy.include_file_resourcesand recommend instead settingPythonPackagingPolicy.file_scanner_emit_filestoFalseto avoid includingFiles.- Again, use a callback for fine-grained control
- Improve the documentation (and possibly API) for
PythonExecutable.filter_from_files. I suspect it offers an even easier approach than a callback for fine-grained control, but I couldn't make heads or tails of how to use it (how do I know what my resources are named?) based on its existing docs.
There would no longer be an automatic way to avoid including the entire standard library, but Python simply doesn't work without some Python modules, such as the encodings package. My proposal would also reduce the size of PythonPackagingPolicy by three fields, which I think would make the API easier to understand.
First off, I love this project and it has been a huge benefit to my jc app!
I believe I'm running into this issue as well on Pyoxidizer 0.16.0. I'm noticing the binary size is significantly larger now than it was on v0.7.0. My linux binaries are around 90MB, whereas on 0.7.0 they were around 58MB. MacOS binaries are also significantly larger.
Even after compressing they are really too large to be useful for my use-case. I tried policy.include_distribution_sources = False and that did significantly reduce the size of the binary, but it would not run with a similar issue to the one noted above.
I'll probably have to hold-off on packaging my app with Pyoxidizer 0.16.0 for the moment until this is addressed. Is there any ETA on this?
Thanks!
Having the same problem on Windows 10 on PyOxidizer 0.17.0, so it seems safe to say that it's not platform-dependent and it's still an issue.
@wkschwartz, I agree that some changes to the Starlark API would be welcome. It wasn't until re-reading some sections several times, trying out several combinations in practice, poking around a bit in the source, and a bunch of print logging that I felt I understood enough of how things were working. It mostly makes sense once you get there, but it feels like some polish/rearranging could go a long way.
- Having a more complete example of a resource callback in the docs would be helpful, I think. https://github.com/indygreg/PyOxidizer/issues/303#issuecomment-715992218 explained how the callbacks work, which in turn helped the bigger picture of the build logic click into place.
-
Resource Attributes Influencing Adding have some interesting quirks. The source/bytecode distinction really only applies to
PythonModuleSource. Maybe it makes sense to have as many attributes the same on different resource types, to avoid attribute errors and require lesstype(resource), and just no-op on the types for which they don't apply, but it could be documented more clearly: one could wonder ifPythonPackageResource.add_source = Falseis a no-op or effectively a strongeradd_include = False. Also, what happens whenadd_bytecode_optimization_level_zero,add_bytecode_optimization_level_one, andadd_bytecode_optimization_level_twoare allTrue? Those probably shouldn't be independent fields.
In case you were still wondering about PythonExecutable.filter_from_files, though I'm 7 months late to the party, it's explained here that it's supposed to consume the output of an instrumented build, taking a sort of "let's exercise the app manually and see which modules are imported in practice" kind of approach to tree-shaking. Although, the TODO in the docs suggest that this might not be fully working currently.
This still seems to be an issue in PyOxidizer 0.19.0, although the exception is now different, at least on my Mac:
installing files to [...]/pyoxidizer-test/./build/x86_64-apple-darwin/debug/install
Python path configuration:
PYTHONHOME = '[...]/pyoxidizer-test/build/x86_64-apple-darwin/debug/install'
PYTHONPATH = (not set)
program name = '[...]/pyoxidizer-test/build/x86_64-apple-darwin/debug/install/pyoxidizer-test'
isolated = 1
environment = 0
user site = 0
import site = 1
sys._base_executable = '[...]/pyoxidizer-test/build/x86_64-apple-darwin/debug/install/pyoxidizer-test'
sys.base_prefix = '[...]/pyoxidizer-test/build/x86_64-apple-darwin/debug/install'
sys.base_exec_prefix = '[...]/pyoxidizer-test/build/x86_64-apple-darwin/debug/install'
sys.platlibdir = 'lib'
sys.executable = '[...]/pyoxidizer-test/build/x86_64-apple-darwin/debug/install/pyoxidizer-test'
sys.prefix = '[...]/pyoxidizer-test/build/x86_64-apple-darwin/debug/install'
sys.exec_prefix = '[...]/pyoxidizer-test/build/x86_64-apple-darwin/debug/install'
sys.path = [
'[...]/build/x86_64-apple-darwin/debug/install/lib/python39.zip',
'[...]/build/x86_64-apple-darwin/debug/install/lib/python3.9',
'[...]/build/x86_64-apple-darwin/debug/install/lib/python3.9/lib-dynload',
]
thread 'main' panicked at 'assertion failed: `(left != right)`
left: `0`,
right: `0`: The Python interpreter is not initalized and the `auto-initialize` feature is not enabled.
Consider calling `pyo3::prepare_freethreaded_python()` before attempting to use Python APIs.', [...]/.cargo/registry/src/github.com-1ecc6299db9ec823/pyo3-0.14.5/src/gil.rs:224:21
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
error: cargo run failed
Is there anything I could do (without significant proficiency in Rust and/or PyOxidizer internals) to help this issue move forward?
I think I figured it out. The problem is that if you set include_distribution_sources to False, this sets the add_include property of all Python module source objects represented by a PythonModuleSource object to False in general, irrespectively of whether the module is in the standard library or not. In addition, it also sets add_source to False on all modules in the standard library, but it doesn't help because add_include prevents the inclusion of both the source and the bytecode anyway. You basically end up with an empty executable that includes the Python built-in modules only (i.e. those that are not backed by a Python source file in the original Python distribution).
The workaround that worked for me is as follows:
def resource_callback(policy, resource):
if type(resource) == "PythonModuleSource":
resource.add_include = True
This resets the add_include property to True for all PythonModuleSource resources. Note that for modules in the standard library, add_source is still False so you end up with the bytecode in the generated executable but not the source. Also note that the snippet above reverts the effect of include_test = False so you end up with all test cases being packaged in the output; I could not find a way to solve this without modifying the Starlark objects and exposing an is_test property on resource that tells whether the resource belongs to a known Python stdlib test package.
Another issue that I've been running into was when I tried to get rid of the docstrings in the generated executable. Level 0 optimization of the bytecode still keeps the docstrings; one needs to disable level 0 and enable level 2 in pyoxidizer.bzl as follows:
policy.bytecode_optimize_level_zero = False
policy.bytecode_optimize_level_two = True
However, one thing that is not documented is that you also need to set python_config.optimization_level to 2, otherwise the oxidized importer will still keep on looking for level-0 optimized bytecode, and since it won't find it (and the sources are also excluded), the executable will fail.
@indygreg Do you agree that the fact that setting include_distribution_sources to False sets add_include to False on all modules is a bug? If so, would you accept a PR that leaves it at True? (after all, the exclusion of sources is taken care of by setting add_source to False on the standard library modules)
I wanted to summarize some of this a bit since it seems like there are at least partial workarounds, though I still don't see an official solution. By using @ntamas suggestions and using pyoxidizer v0.22.0, I was able to reduce the executable size from over 100MB to 54MB. After compressing the executable, it sits at around 25MB.
Here is the pyoxidizer.bzl I'm using. Are there any more optimizations anyone sees that are missing?
def make_dist():
return default_python_distribution()
def resource_callback(policy, resource):
if type(resource) == "PythonModuleSource":
resource.add_include = True
def make_exe(dist):
policy = dist.make_python_packaging_policy()
policy.register_resource_callback(resource_callback)
policy.extension_module_filter = "no-libraries" # using 'all' creates an 80MB executable vs. 54MB
policy.include_distribution_sources = False
policy.include_non_distribution_sources = False
policy.bytecode_optimize_level_zero = False
policy.bytecode_optimize_level_one = False
policy.bytecode_optimize_level_two = True
python_config = dist.make_python_interpreter_config()
python_config.optimization_level = 2
python_config.run_command = "import jello.cli; jello.cli.main()"
exe = dist.to_python_executable(
name="jello",
packaging_policy=policy,
config=python_config,
)
exe.windows_subsystem = "console"
exe.add_python_resources(exe.pip_install(["-r", "requirements.txt"]))
return exe
def make_install(exe):
files = FileManifest()
files.add_python_resource(".", exe)
return files
register_target("dist", make_dist)
register_target("exe", make_exe, depends=["dist"])
register_target("install", make_install, depends=["exe"], default=True)
resolve_targets()
Using PyOxidizer 0.23.0, I again ran into this bug, but @kellyjonbrazil's workaround in https://github.com/indygreg/PyOxidizer/issues/312#issuecomment-1167745889 worked for me if I did not use policy.extension_module_filter = "no-libraries" (my app uses importlib.metadata, which somehow imports _socket).
One thing to know with that workaround is that using the callback to set add_include = True on PythonModuleSource resources overrides setting PythonPackagingPolicy.include_test to False. To compensate, you can manually exclude the test packages. (I have used code similar to this and it works; I wrote the specific code below quickly and off the top of my head, so you may need to fix some bugs for it to work.)
PYTHON_STDLIB_TESTS = (
"ctypes.test",
"distutils.tests",
"idlelib.idle_test",
"lib2to3.tests",
"sqlite3.test",
"test",
"tkinter.test",
"unittest.test",
)
def resource_callback(policy, resource):
if type(resource) == "PythonModuleSource":
resource.add_include = True
for package in PYTHON_STDLIB_TESTS:
name = resource.name
if name == package or name.startswith(package + "."):
resource.add_include = False