mpich
mpich copied to clipboard
packaging: filter Python bytecode and tarball contents reproducibility
The autogen.sh and configure scripts run Python code to generated sources. It would be great if things are setup, ideally within autogen.sh and configure, such that the following two environment variables are set before running any Python scripts:
PYTHONDONTWRITEBYTECODE=1
PYTHONHASHSEED=0
Setting PYTHONDONTWRITEBYTECODE=1 prevents the generation of precompiled Python bytecode in __pycache__ directories. Currently, all recent release tarballs in the v4.x series have these unneeded __pycache__ directories.
$ find mpich-4.3.0 -name __pycache__
mpich-4.3.0/maint/local_python/__pycache__
mpich-4.3.0/modules/yaksa/maint/__pycache__
mpich-4.3.0/modules/yaksa/src/backend/__pycache__
Moreover, to prevent this issue from ever popping up again via other means, I would recommend that the tarball generation script explicitly filter/remove these __pycache__ directores before creating release tarballs tarball.
Setting PYTHONHASHSEED=0 disables the hash randomization algorithm, which will help with output reproducibility. For example, if you run the genpup.py scripts in yaksa two successive times, the generated C code will be different each time. This is because of how yaksa/src/backend/gencomm.py defines the type_ops variable. type_ops is a (in Python typing syntax) dict[str, set[str]]. The dict values are sets, and sets (opposed to dicts since Py 3.6) do not preserve insertion order upon iteration, which then depends on the hashing algorithm, thus the requirement of disabling hash randomization to achieve reproducibility.
IMHO, the issue with gencomm.py is better addressed by not using a set for the dict values, but just a list. Even if this fix is eventually implemented, I would still suggest to keep PYTHONHASHSEED=0 to prevent similar issues happening elsewhere.
Great insights and great suggestions! Thank you!
We definitely should filter unnecessary artifacts from release packages. Is it necessary to set PYTHONDONTWRITEBYTECODE=1? It's harmless for users who run autogen and configure themselves, right?
Is it necessary to set PYTHONDONTWRITEBYTECODE=1?
If you implement filtering, then no, it is not really necessary. However, given than you already have to set PYTHONHASHSEED=0 to fix the other issue, an extra env var is not a big deal.
It's harmless for users who run autogen and configure themselves, right?
Yes, that's correct. I'm biased: I have a personal hate for these bytecode cache files lurking around source trees.