pip icon indicating copy to clipboard operation
pip copied to clipboard

Ensure 'pip wheel' can create .so artifacts deterministically

Open thundergolfer opened this issue 6 years ago • 19 comments

What's the problem this feature will solve?

The Bazel build system has the major selling point of supporting both local and remote-caching.

In order for that caching to work though, Bazel targets must be built deterministically so that the same target always has the same content-addressable hash.

Currently pip wheel is non-deterministic, so our Python Bazel targets will cache miss if they depend on something built with pip wheel.

Describe the solution you'd like

Note: The following is the output of a Bazel execution log. A bit unrelated to the pip wheel command but shows the relevant information.

inputs {
  path: "external/pypi__PyYAML_5_1/PyYAML-5.1.dist-info/LICENSE"
  digest {
    hash: "a2adb9c959b797494a0ef80bdf60e22db2749ee3e0c0908556e3eb548f967c56"
    size_bytes: 1101
    hash_function_name: "SHA-256"
  }
}
inputs {
  path: "external/pypi__PyYAML_5_1/PyYAML-5.1.dist-info/METADATA"
  digest {
    hash: "df7bc0c7cbd2ce350c5c61ceda3a74bbcb6f82446a7c01f7f8e1034a98df231f"
    size_bytes: 1704
    hash_function_name: "SHA-256"
  }
}
inputs {
  path: "external/pypi__PyYAML_5_1/PyYAML-5.1.dist-info/RECORD"
  digest {
    hash: "6fe803b74ab4fcab1f23e96060cf062d12779598af7e72692c492c2dd7cad0ed"
    size_bytes: 1701
    hash_function_name: "SHA-256"
  }
}
inputs {
  path: "external/pypi__PyYAML_5_1/PyYAML-5.1.dist-info/WHEEL"
  digest {
    hash: "cdf2c8f141bc498ae490a88870d655dd174abe3db8c1f57562224b168930c624"
    size_bytes: 104
    hash_function_name: "SHA-256"
  }
}
inputs {
  path: "external/pypi__PyYAML_5_1/PyYAML-5.1.dist-info/top_level.txt"
  digest {
    hash: "ae98f42153138ac02387fd6f1b709c7fdbf98e9090c00cfa703d48554e597614"
    size_bytes: 11
    hash_function_name: "SHA-256"
  }
}
inputs {
  path: "external/pypi__PyYAML_5_1/_yaml.cpython-36m-x86_64-linux-gnu.so"
  digest {
    hash: "a7f3774015f839ccee5e2281bbfdf22a42e0e1dafaac33ef4c91db83a07210d9"
    size_bytes: 1133288
    hash_function_name: "SHA-256"
  }
}
inputs {
  path: "external/pypi__PyYAML_5_1/yaml/__init__.py"
  digest {
    hash: "2af8b6dbcb1df5c63597f215421cad02f2317e291061b181b0f7bbf4f71ac0dd"
    size_bytes: 12012
    hash_function_name: "SHA-256"
  }
}

The following is a subset of the build outputs of the PyYAML package. Of the build outputs, it is the RECORD files and the _yaml.cpython-36m-x86_64-linux-gnu.so shared object file that have non-deterministic hashes build to build. I have inspected the RECORD file and found that it contains the hash of the .so file, so it is non-deterministic because of the .so file, and I think only because of that.

So the problem is the .so file.

I ran the strings program on the .so file and found this printable string: /tmp/pip-wheel-_bd8v3f2/pyyaml. That is coming from here:

https://github.com/pypa/pip/blob/6af9de97bbd2427f82942e476c590a2db22ea1ff/src/pip/_internal/wheel.py#L649

So while I found other differences between different _yaml.cpython-36m-x86_64-linux-gnu.so, this tmp directory usage leaking in itself is sufficient to break determinism.

Additional context

rules_python issue discussing this problem: https://github.com/bazelbuild/rules_python/issues/154 rules_python repo: https://github.com/bazelbuild/rules_python

thundergolfer avatar May 15 '19 01:05 thundergolfer

This string is embedded in the debug information:

$ objdump -Wi _yaml.cpython-37m-x86_64-linux-gnu.so  | grep /tmp
    <15>   DW_AT_comp_dir    : (indirect string, offset: 0x76b): /tmp/user/1000/pip-install-94g5rob6/pyyaml

It looks like this is a common issue:

  • golang/go#16860
  • https://tests.reproducible-builds.org/debian/issues/unstable/gcc_captures_build_path_issue.html

It may also be related to pypa/wheel#248.

Using a deterministic build directory name would leave us open to denial-of-service and/or concurrency-related problems.

Using e.g. the GCC flag -fdebug-prefix-map=$SRC_ROOT=. is another option, but is this something that should be in pip, wheel, or the individual build backends?

chrahunt avatar Jun 18 '19 04:06 chrahunt

Thanks for your input Christopher.

would leave us open to denial-of-service and/or concurrency-related problems.

The latter makes sense to me, but how exactly is DOS at play if you use a consistent build dir?

should be in pip, wheel, or the individual build backends?

For my use case, the input of the Bazel / rules_python team would be helpful in deciding. They haven't responded to my issue in rules_python yet. Might be time for a nudge.

thundergolfer avatar Jun 18 '19 05:06 thundergolfer

The latter makes sense to me, but who exactly is DOS at play if you use a consistent build dir?

An unprivileged user on the same host can create the build directory and set its permissions to 700, which prevents you from building.

chrahunt avatar Jun 18 '19 08:06 chrahunt

In a simple test, I was able to get consistent builds by exporting CFLAGS=-g0 before building the wheel. This prevents adding any of the debug information to the generated libraries which is where the TempDirectory was being pulled in. I also have SOURCE_DATE_EPOCH set. I don't know how universal this is (and, of course, you lose debugging symbols).

manthey avatar Aug 30 '19 14:08 manthey

This is an issue not just with pip wheel, but pip install in general when not run in editable mode :(.

CC: @bdrewery

ngie-eign avatar Jul 15 '20 21:07 ngie-eign

Interesting... this is a common pattern that can't be overridden:

$ grep -rI TempDirectory  
pip/_internal/build_env.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/build_env.py:        self._temp_dir = TempDirectory(kind="build-env")
pip/_internal/cache.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/cache.py:        self._temp_dir = TempDirectory(kind="ephem-wheel-cache")
pip/_internal/download.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/download.py:    with TempDirectory(kind="unpack") as temp_dir:
pip/_internal/wheel.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/wheel.py:        with TempDirectory(kind="wheel") as temp_dir:
pip/_internal/commands/download.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/commands/download.py:            with RequirementTracker() as req_tracker, TempDirectory(
pip/_internal/commands/install.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/commands/install.py:        target_temp_dir = TempDirectory(kind="target")
pip/_internal/commands/install.py:            with RequirementTracker() as req_tracker, TempDirectory(
pip/_internal/commands/wheel.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/commands/wheel.py:            with RequirementTracker() as req_tracker, TempDirectory(
pip/_internal/req/req_install.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/req/req_install.py:        self._temp_build_dir = TempDirectory(kind="req-build")
pip/_internal/req/req_install.py:        with TempDirectory(kind="record") as temp_dir:
pip/_internal/req/req_tracker.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/req/req_tracker.py:            self._temp_dir = TempDirectory(delete=False, kind='req-tracker')
pip/_internal/req/req_uninstall.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/req/req_uninstall.py:        self.save_dir = TempDirectory(kind="uninstall")
pip/_internal/utils/temp_dir.py:class TempDirectory(object):
pip/_internal/utils/temp_dir.py:        super(TempDirectory, self).__init__()
pip/_internal/vcs/bazaar.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/vcs/bazaar.py:        with TempDirectory(kind="export") as temp_dir:
pip/_internal/vcs/git.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/vcs/git.py:        with TempDirectory(kind="export") as temp_dir:
pip/_internal/vcs/mercurial.py:from pip._internal.utils.temp_dir import TempDirectory
pip/_internal/vcs/mercurial.py:        with TempDirectory(kind="export") as temp_dir:

From pip.util.TempDirectory:

     60     def create(self):
     61         """Create a temporary directory and store it's path in self.path
     62         """
     63         if self.path is not None:
     64             logger.debug(
     65                 "Skipped creation of temporary directory: {}".format(self.path)
     66             )
     67             return
     68         # We realpath here because some systems have their default tmpdir
     69         # symlinked to another directory.  This tends to confuse build
     70         # scripts, so we canonicalize the path by traversing potential
     71         # symlinks here.
     72         self.path = os.path.realpath(
     73             tempfile.mkdtemp(prefix="pip-{}-".format(self.kind))
     74         )
     75         logger.debug("Created temporary directory: {}".format(self.path))

It's unfortunate, but it looks like the mkdtemp wrapper lacks the ability to override the template, which would allow us to set a deterministic build path:

$ man 3 mktemp
...
     char *
     mkdtemp(char *template);

Sidenote: hmmm... the notion of tempfiles in cpython as of 3.9 seems insecure; it bypasses better security practices by reimplementing mkstemp(3), etc, with a deterministic algorithm/template consisting of only 8 "random" characters seeded by the PID :(.

ngie-eign avatar Jul 15 '20 23:07 ngie-eign

I've fallen back to invoking setup.py directly for building binaries and am using pip only when installing the binaries.

It's really unfortunate that there isn't a better story around this with pip, given that pip is sort of the defacto core installation utility for python. Hopefully the Chan-Zuckerberg work will improve this support/usability, especially since Facebook used buck (alternative to bazel) by default, which does similar tricks in terms of caching artifacts on remote shares to speed up their build process.

ngie-eign avatar Jul 17 '20 04:07 ngie-eign

Is this another argument in favor of in-tree builds (#7555)?

sbidoul avatar Jul 17 '20 08:07 sbidoul

It is slightly different, since building in the source tree does not necessarily mean the built artifacts are in the source tree. It is only by tradition the most popular back-end (setuptools) does this. Having in-tree builds would happen to solve the immediate problem, but IMO the ultimate solution to this problem would be to introduce a flag to PEP 517 that can tell the back-end where they must generate the artifact in, and create a flag in pip to let user provide that information.

uranusjr avatar Jul 17 '20 11:07 uranusjr

Another piece of software that breaks because of non-deterministic paths is sccache, which requires paths to match in order to get a cache hit. Also, the DOS thing probably isn't important when builds are containerized.

vlad-ivanov-name avatar May 13 '22 16:05 vlad-ivanov-name

So... pip no longer performs local builds in a non-deterministic location, so if users are seeing non-deterministic build outputs, it's likely not pip but instead the build-backend that the package being built is using.

pradyunsg avatar May 13 '22 17:05 pradyunsg

Hmm, but in practical terms that means the issue still occurs when e. g. installing packages / building wheels from git URLs

EDIT: see below

vlad-ivanov-name avatar May 14 '22 09:05 vlad-ivanov-name

pip needs to download source distributions (from VCS or archives) to temporary directories because it may not even know their name in advance, and when the name is known, it may need to download and prepare metadata for different versions of the same project during the resolution process.

One thing we could imagine is moving/renaming the temporary unpack directory to a predictable location (say $TMPDIR/build/{canonical-name}) before the build step. EDIT: this is a simplistic approach that is not implementable as is at it opens the door to concurrency issues.

@vlad-ivanov-name I don't think the code you highlight above is relevant because it is merely the target directory where the built wheel must be stored, which should not be relevant to the content of the wheel.

sbidoul avatar May 14 '22 10:05 sbidoul

That makes sense, thank you.

For the purpose of CI caching, where paths being deterministic 90% of the time is good enough already, I'm considering monkey-patching tempfile.mkdtemp to use a deterministic seed instead of PID, and then invoking pip via runpy. This might fail if downloads are multi-threaded though but it doesn't seem to be the case at the moment.

One thing we could imagine is moving/renaming the temporary unpack directory to a predictable location (say $TMPDIR/build/{canonical-name}) before the build step.

I think predictable and deterministic are a bit different, again for caching having deterministic paths is enough even if the way of deriving those paths is convoluted

vlad-ivanov-name avatar May 14 '22 10:05 vlad-ivanov-name

Is this not a problem for the build backend? I'm struggling to see why the problem here isn't that the backend embeds the full pathname into the output, rather than just a relative name.

pfmoore avatar May 14 '22 10:05 pfmoore

Yes this should first be addressed in build backends and now that --config-settings is implemented all means to provide necessary options are available.

sbidoul avatar May 14 '22 10:05 sbidoul

Well... At the end of the day, while building things in a reproducible manner is a valueable activity, pip is not a tool that can enforce that right now. Note that the wheels are built by a "build backend" such as https://github.com/pypa/setuptools/ or https://github.com/pypa/flit/ or https://github.com/pypa/hatch. All that pip is doing is calling them and copy-pasting their artifacts over.

Basically, I don't think the guarentee that's being requested here is something that pip itself can provide, on its own anyway. There's tooling available to build wheels in a reproducible manner, like https://github.com/kushaldas/asaman -- which uses pip under the hood and sets up everything for the relevant build-backends to build things in a reproducible manner (assuming they follow https://reproducible-builds.org/ model).

pradyunsg avatar May 14 '22 10:05 pradyunsg

Basically, I don't think the guarentee that's being requested here is something that pip itself can provide, on its own anyway.

That is fair; certainly, pip alone won't be able to manage all possible caveats (at the end of the day one could always put __DATE__ in source code).

However, I certainly could see pip assisting with it to some degree, for example, by providing a way to change the behaviour of the TempDirectory class to control download locations during pip install. No complicated API is necessary but something as simple as a global counter instead of "random" name would work.

I do understand that to some degree, it is the problem of build systems, cache wrappers etc; but in some cases, those have no choice but to depend on absolute paths as the mechanism of detecting whether the path would affect the output would be too complicated and unreliable to implement (example: https://github.com/mozilla/sccache/issues/35).

asaman, as a wrapper for pip, could work but unfortunately doesn't support SCM URLs

vlad-ivanov-name avatar May 14 '22 14:05 vlad-ivanov-name

In a simple test, I was able to get consistent builds by exporting CFLAGS=-g0 before building the wheel. This prevents adding any of the debug information to the generated libraries which is where the TempDirectory was being pulled in. I also have SOURCE_DATE_EPOCH set. I don't know how universal this is (and, of course, you lose debugging symbols).

This worked for me on Linux for all packages encountered, but only for some on macOS. In particular, it seems packages that used Cython as part of their setup.py still had some debug symbols when built on macOS.

I am atm only interested in the pip install case (as opposed to pip wheel). Running strip -x on the build artifacts (after pip install finished) worked.

E.g.

$ find "$VENV_DIR/lib/python$PY_VER_MAJOR/site-packages/" -type f -name '*.so' -print0 \
    | xargs -0 -t strip -x

SomberNight avatar Aug 06 '22 06:08 SomberNight

Given the state of the ecosystem and the devolution of build behaviours, I'm gonna close this out and say that you should help improve asaman if you want this.

pradyunsg avatar Mar 14 '23 10:03 pradyunsg

Linking this here for people looking for possible solutions: https://discuss.python.org/t/introducing-asaman-a-tool-to-bulid-reproducible-wheels/10932

uranusjr avatar Mar 14 '23 10:03 uranusjr