pip
pip copied to clipboard
Ensure 'pip wheel' can create .so artifacts deterministically
What's the problem this feature will solve?
The Bazel build system has the major selling point of supporting both local and remote-caching.
In order for that caching to work though, Bazel targets must be built deterministically so that the same target always has the same content-addressable hash.
Currently pip wheel is non-deterministic, so our Python Bazel targets will cache miss if they depend on something built with pip wheel.
Describe the solution you'd like
Note: The following is the output of a Bazel execution log. A bit unrelated to the
pip wheelcommand but shows the relevant information.
inputs {
path: "external/pypi__PyYAML_5_1/PyYAML-5.1.dist-info/LICENSE"
digest {
hash: "a2adb9c959b797494a0ef80bdf60e22db2749ee3e0c0908556e3eb548f967c56"
size_bytes: 1101
hash_function_name: "SHA-256"
}
}
inputs {
path: "external/pypi__PyYAML_5_1/PyYAML-5.1.dist-info/METADATA"
digest {
hash: "df7bc0c7cbd2ce350c5c61ceda3a74bbcb6f82446a7c01f7f8e1034a98df231f"
size_bytes: 1704
hash_function_name: "SHA-256"
}
}
inputs {
path: "external/pypi__PyYAML_5_1/PyYAML-5.1.dist-info/RECORD"
digest {
hash: "6fe803b74ab4fcab1f23e96060cf062d12779598af7e72692c492c2dd7cad0ed"
size_bytes: 1701
hash_function_name: "SHA-256"
}
}
inputs {
path: "external/pypi__PyYAML_5_1/PyYAML-5.1.dist-info/WHEEL"
digest {
hash: "cdf2c8f141bc498ae490a88870d655dd174abe3db8c1f57562224b168930c624"
size_bytes: 104
hash_function_name: "SHA-256"
}
}
inputs {
path: "external/pypi__PyYAML_5_1/PyYAML-5.1.dist-info/top_level.txt"
digest {
hash: "ae98f42153138ac02387fd6f1b709c7fdbf98e9090c00cfa703d48554e597614"
size_bytes: 11
hash_function_name: "SHA-256"
}
}
inputs {
path: "external/pypi__PyYAML_5_1/_yaml.cpython-36m-x86_64-linux-gnu.so"
digest {
hash: "a7f3774015f839ccee5e2281bbfdf22a42e0e1dafaac33ef4c91db83a07210d9"
size_bytes: 1133288
hash_function_name: "SHA-256"
}
}
inputs {
path: "external/pypi__PyYAML_5_1/yaml/__init__.py"
digest {
hash: "2af8b6dbcb1df5c63597f215421cad02f2317e291061b181b0f7bbf4f71ac0dd"
size_bytes: 12012
hash_function_name: "SHA-256"
}
}
The following is a subset of the build outputs of the PyYAML package. Of the build outputs, it is the RECORD files and the _yaml.cpython-36m-x86_64-linux-gnu.so shared object file that have non-deterministic hashes build to build. I have inspected the RECORD file and found that it contains the hash of the .so file, so it is non-deterministic because of the .so file, and I think only because of that.
So the problem is the .so file.
I ran the strings program on the .so file and found this printable string: /tmp/pip-wheel-_bd8v3f2/pyyaml. That is coming from here:
https://github.com/pypa/pip/blob/6af9de97bbd2427f82942e476c590a2db22ea1ff/src/pip/_internal/wheel.py#L649
So while I found other differences between different _yaml.cpython-36m-x86_64-linux-gnu.so, this tmp directory usage leaking in itself is sufficient to break determinism.
Additional context
rules_python issue discussing this problem: https://github.com/bazelbuild/rules_python/issues/154
rules_python repo: https://github.com/bazelbuild/rules_python
This string is embedded in the debug information:
$ objdump -Wi _yaml.cpython-37m-x86_64-linux-gnu.so | grep /tmp
<15> DW_AT_comp_dir : (indirect string, offset: 0x76b): /tmp/user/1000/pip-install-94g5rob6/pyyaml
It looks like this is a common issue:
- golang/go#16860
- https://tests.reproducible-builds.org/debian/issues/unstable/gcc_captures_build_path_issue.html
It may also be related to pypa/wheel#248.
Using a deterministic build directory name would leave us open to denial-of-service and/or concurrency-related problems.
Using e.g. the GCC flag -fdebug-prefix-map=$SRC_ROOT=. is another option, but is this something that should be in pip, wheel, or the individual build backends?
Thanks for your input Christopher.
would leave us open to denial-of-service and/or concurrency-related problems.
The latter makes sense to me, but how exactly is DOS at play if you use a consistent build dir?
should be in pip, wheel, or the individual build backends?
For my use case, the input of the Bazel / rules_python team would be helpful in deciding. They haven't responded to my issue in rules_python yet. Might be time for a nudge.
The latter makes sense to me, but who exactly is DOS at play if you use a consistent build dir?
An unprivileged user on the same host can create the build directory and set its permissions to 700, which prevents you from building.
In a simple test, I was able to get consistent builds by exporting CFLAGS=-g0 before building the wheel. This prevents adding any of the debug information to the generated libraries which is where the TempDirectory was being pulled in. I also have SOURCE_DATE_EPOCH set. I don't know how universal this is (and, of course, you lose debugging symbols).
This is an issue not just with pip wheel, but pip install in general when not run in editable mode :(.
CC: @bdrewery
Interesting... this is a common pattern that can't be overridden:
$ grep -rI TempDirectory
pip/_internal/build_env.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/build_env.py: self._temp_dir = TempDirectory(kind="build-env")
pip/_internal/cache.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/cache.py: self._temp_dir = TempDirectory(kind="ephem-wheel-cache")
pip/_internal/download.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/download.py: with TempDirectory(kind="unpack") as temp_dir:
pip/_internal/wheel.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/wheel.py: with TempDirectory(kind="wheel") as temp_dir:
pip/_internal/commands/download.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/commands/download.py: with RequirementTracker() as req_tracker, TempDirectory(
pip/_internal/commands/install.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/commands/install.py: target_temp_dir = TempDirectory(kind="target")
pip/_internal/commands/install.py: with RequirementTracker() as req_tracker, TempDirectory(
pip/_internal/commands/wheel.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/commands/wheel.py: with RequirementTracker() as req_tracker, TempDirectory(
pip/_internal/req/req_install.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/req/req_install.py: self._temp_build_dir = TempDirectory(kind="req-build")
pip/_internal/req/req_install.py: with TempDirectory(kind="record") as temp_dir:
pip/_internal/req/req_tracker.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/req/req_tracker.py: self._temp_dir = TempDirectory(delete=False, kind='req-tracker')
pip/_internal/req/req_uninstall.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/req/req_uninstall.py: self.save_dir = TempDirectory(kind="uninstall")
pip/_internal/utils/temp_dir.py:class TempDirectory(object):
pip/_internal/utils/temp_dir.py: super(TempDirectory, self).__init__()
pip/_internal/vcs/bazaar.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/vcs/bazaar.py: with TempDirectory(kind="export") as temp_dir:
pip/_internal/vcs/git.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/vcs/git.py: with TempDirectory(kind="export") as temp_dir:
pip/_internal/vcs/mercurial.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/vcs/mercurial.py: with TempDirectory(kind="export") as temp_dir:
From pip.util.TempDirectory:
60 def create(self):
61 """Create a temporary directory and store it's path in self.path
62 """
63 if self.path is not None:
64 logger.debug(
65 "Skipped creation of temporary directory: {}".format(self.path)
66 )
67 return
68 # We realpath here because some systems have their default tmpdir
69 # symlinked to another directory. This tends to confuse build
70 # scripts, so we canonicalize the path by traversing potential
71 # symlinks here.
72 self.path = os.path.realpath(
73 tempfile.mkdtemp(prefix="pip-{}-".format(self.kind))
74 )
75 logger.debug("Created temporary directory: {}".format(self.path))
It's unfortunate, but it looks like the mkdtemp wrapper lacks the ability to override the template, which would allow us to set a deterministic build path:
$ man 3 mktemp
...
char *
mkdtemp(char *template);
Sidenote: hmmm... the notion of tempfiles in cpython as of 3.9 seems insecure; it bypasses better security practices by reimplementing mkstemp(3), etc, with a deterministic algorithm/template consisting of only 8 "random" characters seeded by the PID :(.
I've fallen back to invoking setup.py directly for building binaries and am using pip only when installing the binaries.
It's really unfortunate that there isn't a better story around this with pip, given that pip is sort of the defacto core installation utility for python. Hopefully the Chan-Zuckerberg work will improve this support/usability, especially since Facebook used buck (alternative to bazel) by default, which does similar tricks in terms of caching artifacts on remote shares to speed up their build process.
Is this another argument in favor of in-tree builds (#7555)?
It is slightly different, since building in the source tree does not necessarily mean the built artifacts are in the source tree. It is only by tradition the most popular back-end (setuptools) does this. Having in-tree builds would happen to solve the immediate problem, but IMO the ultimate solution to this problem would be to introduce a flag to PEP 517 that can tell the back-end where they must generate the artifact in, and create a flag in pip to let user provide that information.
Another piece of software that breaks because of non-deterministic paths is sccache, which requires paths to match in order to get a cache hit. Also, the DOS thing probably isn't important when builds are containerized.
So... pip no longer performs local builds in a non-deterministic location, so if users are seeing non-deterministic build outputs, it's likely not pip but instead the build-backend that the package being built is using.
Hmm, but in practical terms that means the issue still occurs when e. g. installing packages / building wheels from git URLs
EDIT: see below
pip needs to download source distributions (from VCS or archives) to temporary directories because it may not even know their name in advance, and when the name is known, it may need to download and prepare metadata for different versions of the same project during the resolution process.
One thing we could imagine is moving/renaming the temporary unpack directory to a predictable location (say $TMPDIR/build/{canonical-name}) before the build step. EDIT: this is a simplistic approach that is not implementable as is at it opens the door to concurrency issues.
@vlad-ivanov-name I don't think the code you highlight above is relevant because it is merely the target directory where the built wheel must be stored, which should not be relevant to the content of the wheel.
That makes sense, thank you.
For the purpose of CI caching, where paths being deterministic 90% of the time is good enough already, I'm considering monkey-patching tempfile.mkdtemp to use a deterministic seed instead of PID, and then invoking pip via runpy. This might fail if downloads are multi-threaded though but it doesn't seem to be the case at the moment.
One thing we could imagine is moving/renaming the temporary unpack directory to a predictable location (say $TMPDIR/build/{canonical-name}) before the build step.
I think predictable and deterministic are a bit different, again for caching having deterministic paths is enough even if the way of deriving those paths is convoluted
Is this not a problem for the build backend? I'm struggling to see why the problem here isn't that the backend embeds the full pathname into the output, rather than just a relative name.
Yes this should first be addressed in build backends and now that --config-settings is implemented all means to provide necessary options are available.
Well... At the end of the day, while building things in a reproducible manner is a valueable activity, pip is not a tool that can enforce that right now. Note that the wheels are built by a "build backend" such as https://github.com/pypa/setuptools/ or https://github.com/pypa/flit/ or https://github.com/pypa/hatch. All that pip is doing is calling them and copy-pasting their artifacts over.
Basically, I don't think the guarentee that's being requested here is something that pip itself can provide, on its own anyway. There's tooling available to build wheels in a reproducible manner, like https://github.com/kushaldas/asaman -- which uses pip under the hood and sets up everything for the relevant build-backends to build things in a reproducible manner (assuming they follow https://reproducible-builds.org/ model).
Basically, I don't think the guarentee that's being requested here is something that pip itself can provide, on its own anyway.
That is fair; certainly, pip alone won't be able to manage all possible caveats (at the end of the day one could always put __DATE__ in source code).
However, I certainly could see pip assisting with it to some degree, for example, by providing a way to change the behaviour of the TempDirectory class to control download locations during pip install. No complicated API is necessary but something as simple as a global counter instead of "random" name would work.
I do understand that to some degree, it is the problem of build systems, cache wrappers etc; but in some cases, those have no choice but to depend on absolute paths as the mechanism of detecting whether the path would affect the output would be too complicated and unreliable to implement (example: https://github.com/mozilla/sccache/issues/35).
asaman, as a wrapper for pip, could work but unfortunately doesn't support SCM URLs
In a simple test, I was able to get consistent builds by exporting
CFLAGS=-g0before building the wheel. This prevents adding any of the debug information to the generated libraries which is where the TempDirectory was being pulled in. I also haveSOURCE_DATE_EPOCHset. I don't know how universal this is (and, of course, you lose debugging symbols).
This worked for me on Linux for all packages encountered, but only for some on macOS. In particular, it seems packages that used Cython as part of their setup.py still had some debug symbols when built on macOS.
I am atm only interested in the pip install case (as opposed to pip wheel).
Running strip -x on the build artifacts (after pip install finished) worked.
E.g.
$ find "$VENV_DIR/lib/python$PY_VER_MAJOR/site-packages/" -type f -name '*.so' -print0 \
| xargs -0 -t strip -x
Given the state of the ecosystem and the devolution of build behaviours, I'm gonna close this out and say that you should help improve asaman if you want this.
Linking this here for people looking for possible solutions: https://discuss.python.org/t/introducing-asaman-a-tool-to-bulid-reproducible-wheels/10932