PyOxidizer icon indicating copy to clipboard operation
PyOxidizer copied to clipboard

pkgutil.get_data() with in-memory resources

Open dae opened this issue 4 years ago • 5 comments
trafficstars

Third-party libraries like jsonschema, that call pkgutil.get_data() to read their bundled resources, currently fail when the sources are in memory, as they lack a __file__ attribute. PyOxidizer's ResourceReader interface does not require __file__, and can successfully retrieve the resources. By dynamically replacing get_data() with a call to open_resource(), it appears to be possible to get jsonschema to work when in memory (mostly - it still has a dependency on an extension module):

def make_exe():
    dist = default_python_distribution()

    policy = dist.make_python_packaging_policy()
    policy.resources_location = "in-memory"
    policy.resources_location_fallback = "filesystem-relative:lib"

    python_config = dist.make_python_interpreter_config()
    python_config.run_command = """
import sys
import pkgutil
import importlib

def get_data_custom(package, resource):
    try:
        module = importlib.import_module(package)
        with module.__loader__.get_resource_reader(package).open_resource(resource) as f:
            return f.read()
    except:
        return None
if getattr(sys, "oxidized", False):
    pkgutil.get_data = get_data_custom

# quick jsonschema test
schema = {"type" : "object", "properties" : {"price" : {"type" : "number"}} }

import jsonschema
jsonschema.validate(instance=dict(price="test"), schema=schema)
"""

    exe = dist.to_python_executable(
        name = "pyapp",
        packaging_policy = policy,
        config = python_config,
    )
    for resource in exe.pip_install(["jsonschema"]):
        exe.add_python_resource(resource)
    return exe

def make_embedded_resources(exe):
    return exe.to_embedded_resources()

def make_install(exe):
    files = FileManifest()
    files.add_python_resource(".", exe)
    return files

register_target("exe", make_exe)
register_target("resources", make_embedded_resources, depends = ["exe"], default_build_script = True)
register_target("install", make_install, depends = ["exe"], default = True)

resolve_targets()

The sys.oxidized gate is necessary to ensure things run correctly outside PyOxidizer, as the default Python ResourceReader can not handle subdirs in the resource path.

I have not tested this with any other libraries yet, but presume this approach is not limited to jsonschema. It's a hack, and I'm not suggesting this be included in PyOxidizer itself, but I thought it might be worth mentioning in case anyone finds it useful.

Partially related to #436

dae avatar Oct 14 '21 06:10 dae

Here's the implementation of pkgutil.read_data() from the Python stdlib:

def get_data(package, resource):
    spec = importlib.util.find_spec(package)
    if spec is None:
        return None
    loader = spec.loader
    if loader is None or not hasattr(loader, 'get_data'):
        return None
    # XXX needs test
    mod = (sys.modules.get(package) or
           importlib._bootstrap._load(spec))
    if mod is None or not hasattr(mod, '__file__'):
        return None

    # Modify the resource name to be compatible with the loader.get_data
    # signature - an os.path format "filename" starting with the dirname of
    # the package's __file__
    parts = resource.split('/')
    parts.insert(0, os.path.dirname(mod.__file__))
    resource_name = os.path.join(*parts)
    return loader.get_data(resource_name)

That if mod is None or not hasattr(mode, '__file__') check is preventing pkgutil.get_data() from working with PyOxidizer, even though OxidizedFinder implements get_data(). Sadness.

I'm really tempted to give up on __file__ being optional and set it to a virtual path under the current executable's path. This is what zipimport does and as long as a sys.path_hooks entry is registered, things should just work.

indygreg avatar Oct 24 '21 19:10 indygreg

FWIW jsonschema is now using importlib.resources (https://github.com/Julian/jsonschema/pull/873/files) - v4.2.0.

sevein avatar Nov 03 '21 20:11 sevein

Hmm, the new code doesn't seem to work for me on a pyoxidized Python 3.9:

  File "jsonschema", line 29, in <module>
  File "jsonschema.validators", line 349, in <module>
  File "jsonschema._utils", line 60, in load_schema
  File "importlib.resources", line 147, in files
  File "importlib._common", line 14, in from_package
  File "importlib._common", line 18, in fallback_resources
  File "pathlib", line 1082, in __new__
  File "pathlib", line 707, in _from_parts
  File "pathlib", line 691, in _parse_args
TypeError: expected str, bytes or os.PathLike object, not NoneType

dae avatar Nov 05 '21 05:11 dae

You're right @dae, I can reproduce when the location is in-memory. It does work with jsonschema v4.2.0 when using filesystem-relative.

sevein avatar Nov 05 '21 10:11 sevein

The new code breaks due to https://github.com/indygreg/PyOxidizer/issues/237. It's doing exactly that:

It is possible some code somewhere (including in the Python standard library) is buggy and assumes presence of a 3.9 API (e.g. importlib.resources.files() implies support for these APIs on individual meta path importers. I could easily see how someone could test for hasattr(importlib.resources, "files") and assume all meta path importers support .files(). This behavior would be naive about the possibility of 3rd party meta path importers, such as OxidizedFinder.

pquentin avatar Nov 29 '22 11:11 pquentin