rules_python icon indicating copy to clipboard operation
rules_python copied to clipboard

pipy dependency generated py_library ignores .pyc files in installed package.

Open qingyouzhao opened this issue 1 year ago • 5 comments

🐞 bug report

Affected Rule

https://github.com/bazelbuild/rules_python/blob/main/docs/pypi-dependencies.md Specifically py_library with name = "pkg" generated by

pip = use_extension("@rules_python//python/extensions:pip.bzl", "pip")
pip.parse(
    hub_name = "my_deps",
    python_version = "3.11",
    requirements_lock = "//:requirements_lock_3_11.txt",
)
use_repo(pip, "my_deps")

Is this a regression?

I don't know.

Description

I couldn't import a module from pip package that ships python modules with .pyi and .pyc files. For example, I have a example_package that ships with the following file structure

/example_package
  __init__.py
  foo.pyc
  foo.pyi

According to https://peps.python.org/pep-3147/#flow-chart,

import foo from example_package

Should be successful. But the py_library generated specifically excludes "**/*.pyc" from data. If I manually remove the "**/*.pyc" exclusion, the import statement works.

py_library(
    name = "pkg",
    srcs = glob(
        ["site-packages/**/*.py"],
        exclude=[],
        # Empty sources are allowed to support wheels that don't have any
        # pure-Python code, e.g. pymssql, which is written in Cython.
        allow_empty = True,
    ),
    data = [] + glob(
        ["site-packages/**/*"],
        exclude=["**/* *", "**/*.py", "**/*.pyc", "**/*.pyc.*", "**/*.dist-info/RECORD"],
    ),
    # This makes this directory a top-level in the python import
    # search path for anything that depends on this.
    imports = ["site-packages"],
    deps = [],
    tags = ["pypi_name=example_package", "pypi_version=3.21.0"],
    visibility = ["//visibility:public"],
)

How

🔬 Minimal Reproduction

  1. Creating a pip package following https://packaging.python.org/en/latest/ but replace the foo.py with compiled foo.pyc file.
  2. Add pip dependency to a bazel workspace following guides in https://github.com/bazelbuild/rules_python/blob/main/docs/pypi-dependencies.md
  3. In a py_library or py_binary, import foo from example_package should work but errors out with module not found error.

🔥 Exception or Error

ImportError: cannot import name 'foo' from 'example_package'

🌍 Your Environment

Operating System:

Linux

Output of bazel version:

  
Bazelisk version: v1.20.0
Build label: 7.0.2
Build target: @@//src/main/java/com/google/devtools/build/lib/bazel:BazelServer
Build time: Thu Jan 25 16:13:35 2024 (1706199215)
Build timestamp: 1706199215
Build timestamp as int: 1706199215
  

Rules_python version:

  
bazel_dep(name = "rules_python", version = "0.31.0")
  

Anything else relevant?

qingyouzhao avatar Sep 04 '24 16:09 qingyouzhao

This bug was mentioned in a recent PR, so I'll repost what I said there.

The basic gist is that pyc files have been problematic for build determinism and correctness.

Usually pyc files are created by the runtime when they are first imported. This is problematic for Bazel in two ways:

  1. They use timestamp based pyc files. These are problematic because timestamps are non-deterministic, i.e. the pyc content changes even though the source file hasn't changed. Looking through the Python docs, I can't find any way to change the default pyc invalidation mode.
  2. Because they are created upon first import of a module, race conditions can occur, which results in errors. The errors I've seen thus far are
  • Multiple Python processes will attempt to read/write/delete the same pyc file at the same time (under the hood, the way pyc creation works is a foo.pyc.$TIMESTAMP file is created, then atomically moved to its final location). While that is happening, another Python process, or Bazel process can be reading the file. On windows, this results in an error because an open file is deleted
  • Because the set of files is changing, glob() can see a file, the file gets deleted, and then bazel gives an error about the file not being present.

All that said, this behavior is somewhat surprising and perplexing. By default, individual files are symlinked, and they're in a sandbox, so it's not clear how pycs were getting created in the underlying repo directory. But, all the CI issues went away after ignoring pyc files.

rickeylev avatar Oct 11 '24 18:10 rickeylev

@rickeylev Thanks for the explanation. I might not fully grasp the details but I get your point on

  1. Runtime created pyc files are problematic
  2. Ignoring pyc files solves all the CI issues

What I am curious is:

  1. Can bazel differentiate runtime created pyc files vs package installed pyc files?
  2. Can we only ignore *.pyc.* to handle foo.pyc.$TIMESTAMP cases while keeping foo.pyc included?

qingyouzhao avatar Oct 11 '24 20:10 qingyouzhao

Can bazel differentiate runtime created pyc files vs package installed pyc files?

Not really. Both cases are "source" files. So to Bazel, whether it was installed from the package, created by a user, or created by some other process, they look the same.

The closest we're able to do is prevent the interpreter from being able to write the files, e.g. by making everything read-only. Windows strikes here again, though, because it's security model allows an admin user to ignore such read-only attributes.

Alternatively, we could force the interpreter to use DONTWRITEBYTECODE. That might be doable (setting interpreter flags/env has been trickier than expected; this is the sort of thing we have to try to flush out edge cases).

Can we only ignore .pyc. to handle foo.pyc.$TIMESTAMP cases while keeping foo.pyc included?

Yes, but it doesn't prevent the deleting-open-file race condition I mentioned. This only affects windows (only windows gives an error about deleting an open file). It'd be acceptable to have windows ignore pyc, I suppose; our windows support is borderline and best-effort. The main thing is preventing the build errors that occur, where the only thing you can do is re-run bazel and hope things interweave successfully.

rickeylev avatar Oct 11 '24 21:10 rickeylev

Also, to clarify:

All that said, this behavior is somewhat surprising and perplexing. By default, individual files are symlinked, and they're in a sandbox, so it's not clear how pycs were getting created in the underlying repo directory.

Part of me suspects this undesirable behavior is limited to (a) just the interpreter and its stdlib, and/or (b) local, non-sandboxed execution of some sort. That's just a theory -- if it could be shown that theory is correct, maybe we could think of some more solutions.

Two more thoughts.

Precompiling is a builtin feature of the rules now. By setting precompile="enabled" on the targets, it'll make Bazel perform its own precopmilation of py sources. We should change the generated pypi code to enable precompiling; we just haven't gotten aroudn to trying that yet.

If the pypi package is a pyc-only package (rare, but supported), then I think it's reasonble for py_library to accept .pyc files directly in srcs. A pyc-only library wouldn't have any of these issues, as there isn't any py source file to try and generate a new pyc file from.

rickeylev avatar Oct 11 '24 21:10 rickeylev

The main thing is preventing the build errors that occur

In my use case, I am okay with the build errors because the package that I am concerned with is a pyc-only library with pyi files for tooling. I am happy with any special case solution that just works with pure pyc packages.

qingyouzhao avatar Oct 11 '24 21:10 qingyouzhao

I think this issue could be something that outside contributors could come and help out - the whl_library_targets macro is generating the py_library targets that will be consumed by the downstream. There are a few options here:

  • Focus on a special whl_mods based solution, which modifies the included/excluded srcs or something similar.
  • Another approach could be to look at what is in the RECORD file and then include that as part of the py_library. I think I would prefer the second approach - to include the .pyc files if they are in the RECORD file because that means that they are a part of the whl distribution.

Both of the solutions should be first gated by an env variable or a feature flag so that we can roll it out slowly.

aignas avatar Nov 13 '24 08:11 aignas