rules_python
rules_python copied to clipboard
rules/python: Option to generate .pyc files for py_library.
Description of the feature request:
Add options on one or more of the py_library rule itself, the python toolchain configuration, or the bazel command line, to generate python bytecode files as part of the build.
In addition to the basic on/off option, it might be useful to include options to
- Control the "optimization" level (for what that's worth)
- Leave the source
.pyfile out of the build outputs - the.pyc-only use case.
What underlying problem are you trying to solve with this feature?
When python loads a module, it parses the source code to an AST, which it then attempts to cache in bytecode format. Building those as part of the build process solves a few issues:
-
To quote the documentation for the py_compile module:
Though not often needed, this function can be useful when installing modules for shared use, especially if some of the users may not have permission to write the byte-code cache files in the directory containing the source code.
Particularly in the case where bazel is being used to build a tarball of code that includes (but may not be limited to) python, and might then be deployed somewhere that's read-only to most users, it would be useful to be able to include these precompiled bytecode files.
-
The attempt to compile the bytecode files would fail on syntactically-invalid python code, which is probably a good thing for catching failures earlier on in the build process.
-
Having
.pycfiles available improves application startup time. Especially for large python codebases, if some module is transitively imported from thousands of unit tests, currently each of those tests would end up re-parsing the python source file, which is a waste of time. Having the.pycfiles is also helpful for improving startup times for "serverless" platforms. -
The
.pycfiles can be substantially smaller than the source files. For situations where application distribution size is important, e.g. "serverless" platforms, this can matter. -
Some people place value on the marginal degree of obfuscation and tamper resistance offered by
.pyc-only distributions. While reverse-engineering the source from a.pycfile isn't hard, it's also not nothing.
Which operating system are you running Bazel on?
Linux
What is the output of bazel info release?
release 5.3.0
If bazel info release returns development version or (@non-git), tell us how you built Bazel.
No response
What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?
No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
References/notes:
https://docs.python.org/3/library/py_compile.html
The default compilation mode is TIMESTAMP; this is probably a bad idea. In a bazel build we'd probably want to use UNCHECKED_HASH. Would also need to ensure that the embedded path name was appropriately relative.
From PEP-3147:
Python will still support pyc-only distributions, however it will only do so when the pyc file lives in the directory where the py file would have been, i.e. not in the pycache directory. pyc file outside of pycache will only be imported if the py source file is missing.
This means that in the case where the .py file is still being included, the output path would need to depend on the python interpreter version. This probably would require an attribute to be added to py_runtime for that purpose.
cc @rickeylev
This is probably fairly easy to do by adding a precompiler attribute to the toolchain that the rules can use.
The tricky part I'm not sure of is what that attribute points to. It can't point to a py_binary itself, since that would be a circular dependency. But we have to use Python to run py_compile somehow.
Maybe allow it to be an executable or a single py file? If it's an executable, run it. If it's a py file, then run "$interpreter $file". This allows creating a pre-built executable if desired, while also allowing it to Just Work.
(this again touches on the problem of how there's several cases where py-binaries want to be used as part of the toolchain)
In a bazel build we'd probably want to use UNCHECKED_HASH
Probably, yes. CHECKED_HASH is somewhat appealing, too, though, because it allows you to modify the source file and see the new behavior without having to rebuild, though.
I'm +1 on defaulting to one and having an option for the other. Maybe base the default on -c opt so optimized builds used unchecked_hash by default, while fastbuilds use CHECKED_HASH by default. I can easily see a server not caring about this difference, while a CLI app would.
I also like the idea of a library-level control here somehow, too. A scenario we've enountered with large packages (e.g. tensorflow), is the sheer number of modules has its own overhead; pyc helps, but being as how modifying them is the exception, not the rule, making them check their hash is wasted work. We'll have to figure out how a target-level and global-flag-level option should interact, though (which has precedence?).
This means that in the case where the .py file is still being included, the output path would need to depend on the python interpreter version. This probably would require an attribute to be added to py_runtime for that purpose.
Yeah, the details of the pyc infix aren't well understood to me. I think PEP 488 is another half of the puzzle. I swear I thought there was a part of the specification that included the "compile mode" of Python itself, too (e.g. debug for non-debug), but I don't see that mentioned.
We also have to make sure that the generated name is recognized by the runtime that will be used. I think this is controlled by BYTECODE_SUFFIXES? One option is to simply stick this onto the toolchain, too. Another would be to pass the necessary suffix to the stub template so it can configure the runtime at startup.
Additionally, the output path depends on if the original .py files is included or not.
The Major.Minor Python version is in the --python_version flag.
The tricky part I'm not sure of is what that attribute points to. It can't point to a py_binary itself, since that would be a circular dependency. But we have to use Python to run py_compile somehow.
Maybe allow it to be an executable or a single py file? If it's an executable, run it. If it's a py file, then run "$interpreter $file". This allows creating a pre-built executable if desired, while also allowing it to Just Work.
(this again touches on the problem of how there's several cases where py-binaries want to be used as part of the toolchain)
Yeah, this is kind of a similar problem as the one faced for the coverage tool in bazelbuild/bazel#15590. However, there's an important difference: py_compile is part of the python standard library. Given that, I don't really see a need to offer an option to customization of the compilation tool - bazel could just hard-code running $interpreter -m py_compile, which as far as I can tell would work for all of CPython, PyPy, and MicroPython at least (though I've only tried with CPython).
Probably, yes. CHECKED_HASH is somewhat appealing, too, though, because it allows you to modify the source file and see the new behavior without having to rebuild, though.
I'm +1 on defaulting to one and having an option for the other. Maybe base the default on
-c optso optimized builds used unchecked_hash by default, while fastbuilds use CHECKED_HASH by default. I can easily see a server not caring about this difference, while a CLI app would.
Absolutely! However, that's something you can handle with select, so it doesn't really have to impact the design much.
I think the options we'd want exposed look something like
- No compilation.
CHECKED_HASHUNCHECKED_HASHpyc-only
As well as the mostly orthogonal choices for "optimization" level. (which I will continue to put in quotes. It's really too bad that there isn't an optimization level which strips out docstrings, which you don't need in a server context, but leaves in assertions, which might be a bit dangerous to strip out)
I also like the idea of a library-level control here somehow, too. A scenario we've enountered with large packages (e.g. tensorflow), is the sheer number of modules has its own overhead; pyc helps, but being as how modifying them is the exception, not the rule, making them check their hash is wasted work. We'll have to figure out how a target-level and global-flag-level option should interact, though (which has precedence?).
This is why I suggested pre-compilation should be an attribute on the py_library rule, as well as possibly the ability to set a default at the toolchain level. For one thing, you might want to have some things be distributed as pyc-only, while others might want to include source. You might also want to be able to control the "optimization" level on a per-target basis. I think the safe approach there would be to just say local specification takes precedence over global specification - I can't really imagine a situation where someone would set an override on a library and we'd not want to trust that they had a good reason to do so. At a nitty-gritty API level, however, it does mean we have to be careful about the value of the attribute one sets for this; one might think that compile_mode = None would be a default of "use the toolchain default," but that would catch someone by surprise if they (somewhat reasonably) read that as meaning "no compilation". Easy enough to handle - just make the default for the attribute be e.g. "from_toolchain" or whatever.
Yeah, this is kind of a similar problem as the one faced for the coverage tool in bazelbuild/bazel#15590. However, there's an important difference:
py_compileis part of the python standard library. Given that, I don't really see a need to offer an option to customization of the compilation tool -bazelcould just hard-code running$interpreter -m py_compile, which as far as I can tell would work for all of CPython, PyPy, and MicroPython at least (though I've only tried with CPython).
I think invoking the interpreter directly like that is a workable intermediate solution. So I'm +1 on doing that for now. I'm fairly certain we need a custom program to ensure deterministic behavior and pass the necessary info, though. The cli of -m py_compile isn't sufficient.
However, I need to make a few architectural points about why such solutions make it hard or impossible to take full advantage of Bazel's processing model and features.
As some concrete examples:
When using embedded Python, there's no need for a standalone interpreter, so building one simply for precompilation is unnecessary work. (This is particularly salient to me because that's my primary way of building Python programs).
Precompilation is a near perfect fit for using persistent workers, which would eliminate runtime startup overhead. This requires special support by the executable invoked.
Precompilation must run in the exec configuration. This means the runtime itself may be built differently than the runtime actually used, may be run remotely, and/or run on an entirely different platform. So more steps have to be taken to ensure such details don't leak into the output. For example, the launcher script does several things to prevent system settings from affecting the runtime which have to be re-implemented.
Similarly, the exec configuration usually emphasizes time-to-build over runtime-efficiency. This makes sense for typical user build tools, but less so for something like precompilation; precompilation is well defined and it's rate of invocation scales with the size of the overall build, ie the larger your transitive closure, the more it runs, so you're probably going to benefit from the precompiler (and runtime) being built with optimizations enabled. Such optimizations can extend beyond simply e.g. pyc -O levels, and into C-level optimizations that affect the runtime itself; these optimizations are hard to apply during a build, but are easy to apply to to an prebuilt executable which is later run.
This isn't to say simply doing "$runtime -m py_compile" is bad or wrong, just that it leaves a lot sitting at the Bazel Buffet Table.
As well as the mostly orthogonal choices for "optimization" level. (which I will continue to put in quotes. It's really too bad that there isn't an optimization level which strips out docstrings, which you don't need in a server context, but leaves in assertions, which might be a bit dangerous to strip out)
Srsly, right? This has bugged me for years! It's the sort of thing a custom precompiler executable could do, though. This also makes me think, if we exposed an optimization level, it'd probably be better done as a list of values instead of a simple number.
I think the safe approach there would be to just say local specification takes precedence over global specification - I can't really imagine a situation where someone would set an override on a library and we'd not want to trust that they had a good reason to do so.
I agree. A case we have internally is that protocol buffers (which are handled by a special rule) are compiled by default. This is pretty akin to a library-level setting.
At a nitty-gritty API level, however, it does mean we have to be careful about the value of the attribute one sets for this; one might think that
compile_mode = Nonewould be a default of "use the toolchain default," but that would catch someone by surprise if they (somewhat reasonably) read that as meaning "no compilation". Easy enough to handle - just make the default for the attribute be e.g."from_toolchain"or whatever.
I agree.
Worth noting mostly for the sake of code archeology that there's this comment in the code currently: https://github.com/bazelbuild/bazel/blob/6d72ca979f8cf582f53452d5f905346e7effb113/src/main/java/com/google/devtools/build/lib/rules/python/PythonSemantics.java#L69-L71 however as far as I can tell the implementation https://github.com/bazelbuild/bazel/blob/9d0163002ca63f4cdbaff9420380d72e2a6e38b3/src/main/java/com/google/devtools/build/lib/bazel/rules/python/BazelPythonSemantics.java#L117-L120 simply copy the sources unaltered. But for what it's worth I'm guessing that at some point in the past someone had an implementation, or at least and attempt at one, probably Google-internal-only, for at least some of this functionality.
Thanks for filing & tracking! It would be great to have this option, particularly since this seems to render x-machine remote cache in bazel quite less effective since each machine has its own timestamp of the pyc. So it would be much better to have a more stable and safer hash instead of just timestamps.
Do you have a suggestion of a workaround for the time-being?
BTW, for others looking into this. A simple bazel clean -expunge and then rebuild will show many remote cache misses. Confirm it looking into the exec logs as in https://bazel.build/remote/cache-remote .
I would very much <3 to see this feature.
we have a lot of heavy imports :(
Some of our heavy hitter imports bog down both test & runtime (mainly test)
One of the pain points right now is that we use pytorch in our bazel monorepo. However, because we import it in so many of our libraries, it incurs a large import time overhead (3.8 seconds).
With pycache, it drops down to ~2.5 seconds.
Is the general idea to generate pyc files once when all external third party repos are fetched at analysis time and then to ingest these files as input in py_library?
Would love some pointers on where to start (even if it's just a hacky prototype 😅 )
I'm going to transfer this issue to the rules_python repo because, as of rules_python 0.31.0, we've enabled the rules_python-based implementation of the rules by default for Bazel 7+. This basically means this feature won't land in the Bazel builtin rules, but would instead be in the rules_python implementation.
Would love some pointers on where to start (even if it's just a hacky prototype 😅 )
Can do! I've had parts of this mentally sketched out, but haven't thought it through entirely. This is already partially wired into the rules_python implementation, too.
There's a stub function, maybe_precompile() src that gets called. As output, it returns the new set of sources a library should use, which is either (a) the py files (no precompiling), or (b) both the py files and pyc files, or (c) just the pyc files.
All this function really has to do is loop over the sources, run a tool to generate the pyc files, and return them. The important thing is that it generates deterministic output (i.e. no timestamp based pyc files). It'll probably need some flags to control behavior a bit, as discussed, but lets focus on a prototype first. Anyways, the core code of this function is pretty simple:
pycs = []
for src in src_files:
# Generating a file in another package is an error, so we have to skip
# such cases.
if ctx.label.package != src.owner.package:
continue
pyc = ctx.actions.declare_file(..., sibling = src)
ctx.actions.run(
executable = ...,
inputs = [src],
arguments = ... + [src.path, pyc.path],
outputs = [pyc],
mnemonic = "PyCompile",
progress_message = "Compiling Python %{input}",
toolchain = ...,
)
pycs.append(pyc)
return pycs
That's the first part, which is easy.
The second part is defining the tool that is run to generate the pyc. For the purposes of experimenting, you can add an implicit attribute on py_library pointing to some executable to run. The tricky part here is the executable it points to can't be a regular py_binary -- that would result in a circular dependency. For experimenting, define it how you like to work out details (e.g. a sh_binary that just calls the system python would suffice for testing purposes).
Within Google, we prebuild a binary and use an implicit attribute. A prebuilt binary is actually a good thing in this case, but not required. It allows neat things like building an binary with optimizations, having a generic binary that can generate byte code for any python version, building a cc_binary that embeds python, or heck, you could implement a byte code compiler in Rust or whatever if you really wanted to. I'm getting ahead of myself. Anyways.
It's probably best to just start with invoking something like python precompile.py <src> <pycout>. That's going to be pretty close to how a non-bazel python works.
Using an implicit attribute, however -- that won't work well in the Bazel world where the environment isn't as tightly controlled and where there are many more platform configurations to worry about.
In the open Bazel world, this precompiler tool needs to come from a toolchain. This is because we need toolchain resolution to find a tool that matches the target config (the target python version we need to generate byte code for) that can run on one of our available execution platforms. A basic implementation that re-used the same runtime would look something like this, I think:
toolchain(
name = "precompiler_3_10_linux_toolchain",
target_compatible_with = [
"@rules_python//python/config_settings:is_python_3.10",
],
exec_compatible_with = ["@platforms//os:linux"],
toolchain = ":precompiler_toolchain_impl",
toolchain_type = "//python:precompiler_toolchain_type",
)
py_precompiler_toolchain(
name = "precompiler_toolchain_impl",
interpreter = ":bin/python3.10",
precompiler_src = "@rules_python//tools/precompiler:precompiler.py"
)
py_precompiler_toolchain = rule(..., attrs = {
"interpreter": attr.label(cfg="exec"),
"precompiler_src": attr.label(cfg="exec"),
})
# In the maybe_precompile function
ctx.actions.run(
executable = ctx.toolchains["//python:precompiler_toolchain_type"].interpreter
args = [ctx.toolchains["//python:precompiler_toolchain_type"].precompiler_src] + [src.path, pyc.path],
toolchain = "//python:precompiler_toolchain_type",
...
)
(there's variations of the above that would also work, but thats the gist).
Was prototyping a bit. I really appreciate all the pointers here!
I used the system python interpreter to prototype for now and noticed some weird behavior and perhaps I'm missing something fundamental here, but when bazel "compiles" these python files into pyc, sometimes, we run into the case where the sandbox only contains the directory bazel-out. cding into the sandbox debug path, we see only bazel-out exists and no "external" directory exists for external/pip_grpcio/site-packages/grpc/framework/interfaces/face/utilities.py
(cd /dev/shm/bazel-sandbox.2566462e83a3701bdcf2405eb4d15ec8f4c1eed3744dd6958e7c02b8ef25ffcf/linux-sandbox/153/execroot/test_pyc && \
exec env - \
TMPDIR=/tmp \
/home/ryang/.cache/bazel/_bazel_ryang/install/14fb027596f626f2526df4873ea20b8b/linux-sandbox -t 15 -w /dev/shm -w /dev/shm/bazel-sandbox.2566462e83a3701bdcf2405eb4d15ec8f4c1eed3744dd6958e7c02b8ef25ffcf/linux-sandbox/153/execroot/test_pyc -w /tmp -e /tmp -S /dev/shm/bazel-sandbox.2566462e83a3701bdcf2405eb4d15ec8f4c1eed3744dd6958e7c02b8ef25ffcf/linux-sandbox/153/stats.out -D /dev/shm/bazel-sandbox.2566462e83a3701bdcf2405eb4d15ec8f4c1eed3744dd6958e7c02b8ef25ffcf/linux-sandbox/153/debug.out -- /bin/bash -c '/bin/python3.10 -m py_compile external/pip_grpcio/site-packages/grpc/framework/interfaces/face/utilities.py')
....
ls -a
. .. bazel-out
Debugging this prototype, we see src.path for this situation is external/pip_grpcio/site-packages/grpc/framework/interfaces/face/utilities.py and pyc.path is bazel-out/k8-opt/bin/external/pip_grpcio/site-packages/grpc/framework/interfaces/face/utilities.cpython-310.pyc
Is there something fundamental that I'm missing?
Prototype:
def maybe_precompile(ctx, srcs):
pycs = []
for src in srcs:
if "site-packages" not in src.path:
pycs.append(src)
continue
basename = src.basename
dirname = src.path.rsplit("/", 1)[0].split("/", 2)[2]
pyc_out = dirname + "/__pycache__/" + basename.replace(".py", ".cpython-310.pyc")
pyc = ctx.actions.declare_file(pyc_out)
ctx.actions.run_shell(
outputs = [pyc],
command = "/bin/python3.10 -m py_compile " + src.path
)
pycs.append(pyc)
return pycs
```
That action invocation doesn't look correct.
- The src File object should be an input to the action. This ensures the file is available when the action runs.
- The output path should also be passed to the command. I don't see how the py_compile module would know that the output pyc file is in the same location as where bazel is going to create it.
The second should result in an error from bazel about an output file not being created.
Worth emphasizing the second point there,
- The output path should also be passed to the command. I don't see how the py_compile module would know that the output pyc file is in the same location as where bazel is going to create it.
You cannot assume that the tool can derive the output path from the input path, because for example that prefix bazel-out/k8-opt could also be bazel-out/k8-fastbuild or e.g. something like k8-opt-exec-2B5CBBC6 if there are configuration transitions happening, which there probably will be in some cases.
py_compile will just try to create it as a sibling of the source file in the input tree, or in a __pycache__ subdirectory, which will most likely fail due to that being a read-only filesystem within the sandbox. But I think it'll fail silently because python wants you to be able to use python libraries in directories you don't own.
Another point worth noting here is that if you're using a python script to generate the .pyc then you can lean on the existing python toolchain configuration to find the python executable. This doesn't necessarily obviate the need for a separate compilation toolchain, because toolchain resolution in this case will give you a python interpreter that is compatible with the execution environment, but you might need something other than that to produce output compatible with the target environment.
Prototype done!
Thanks for all the pointers! not pretty, but it does generate all the pyc files! (in our case, we use python310)
def maybe_precompile(ctx, srcs):
"""Computes all the outputs (maybe precompiled) from the input srcs.
See create_binary_semantics_struct for details about this function.
Args:
ctx: Rule ctx.
srcs: List of Files; the inputs to maybe precompile.
Returns:
List of Files; the desired output files derived from the input sources.
"""
pycs = []
for src in srcs:
if ctx.label.package != src.owner.package:
continue
if src.extension != "py":
continue
if "site-packages" not in src.path:
pycs.append(src)
continue
basename = src.basename
dirname = src.path.rsplit("/", 1)[0].split("/", 2)[2]
pyc_out = dirname + "/__pycache__/" + basename.replace(".py", ".cpython-310.pyc")
pyc = ctx.actions.declare_file(pyc_out)
command = "'from py_compile import compile ; compile(\"" + src.path + "\", cfile=\"" + pyc.path + "\")'"
ctx.actions.run_shell(
outputs = [pyc],
inputs = [src],
command = "export SOURCE_DATE_EPOCH=123 && /bin/python3.10 -c " + command
)
pycs.append(pyc)
pycs.append(src)
return pycs
Can clean it up a bit by doing
args = ctx.actions.args()
args.add("-c")
args.add("""
from sys import argv
from py_compile import compile
compile(argv[1], cfile=argv[2])
""")
args.add(src)
args.add(pyc)
runtime=ctx.toolchains["@bazel_tools//tools/python:toolchain_type"].py3_runtime
ctx.actions.run(
executable=runtime.interpreter,
arguments=[args],
env={"SOURCE_DATE_EPOCH": "123"},
tools=[runtime.files]
...
)
Modulo issues with cross-compiling, that at least will ensure use of the right python executable for the current build configuration, avoids the pitfalls of string interpolation in command lines, and is probably a bit more efficient for bazel to analyze.
I'd also note that setting SOURCE_DATE_EPOCH like that isn't really a great solution here; better to set invalidation_mode to PycInvalidationMode.CHECKED_HASH or UNCHECKED_HASH (see discussion above) rather than using a fake timestamp that will either always or never result in recompilation.
Yay, a working prototype!
if "site-packages" not in src.path:
What is this line for? I'm guessing its just debug code because you're trying to isolate it to just the pip-generated files being processed?
Adam's comments are in the right direction for cleaning it up.
For an initial PR, the important parts are:
- As discussed, use the hash-based invalidation setting
- Get the executable from toolchain lookup
- Use
sys.implementation.cache_tagto get thecpython-xxmagic tag part; works well enough to start with.
Before making it generally available we need to:
- Using a file instead of string will probably work better. This also makes it easier use a different implementation using select.
- PYTHONHASHSEED=0 should also be set when running the action to help guard against non-determinism
- The magic tag should should also come from toolchain lookup.
- Address target vs exec config mismatch. This enabled cross-building.
To keep things simple my inclination would be to leave out the "magic tag" altogether, since it's optional and at least in the context of a hermetic build it's very important to have it (worst case you run a different interpreter at runtime and it just rejects and ignores the .pyc). Also, if we want to support .pyc-only mode, we can't be stuffing it into a __pycache__ subdirectory. In that respect I think ideally we'd want to be able to defer the decision of whether or not to go .pyc-only until later, e.g. a packaging action that might want to be able to build a .pyc-only "release" as well as a "debug" package that includes the .pys.
There's also the part of this where it needs to get added in to the py_library. Ideally we'd carry it through the PyInfo provider. If we want to support "pyc-only" mode on a per-library basis, one has to be a little careful here with transitive dependencies to keep track of which .py/.pyc files do/don't exist as a pair, so downstream consumers, e.g. packaging rules, could decide to
- Ship both
.pyand.pycfiles. - Ship
.pyfiles, but only.pycfiles that don't have a corresponding.py - Ship
.pycfiles, except in cases where a.pycwasn't generated for whatever reason.
There's also the question of whether we want to support "optimized" (.pyo) mode, but TBH I'm not sure if anyone actually uses that anywhere so I wouldn't worry about it for initial implementation.
leave out the "magic tag" altogether, ... worst case you ... it just rejects and ignores the .pyc
I think its OK to leave it out for the initial version, but it's necessary in order to support multiple versions of Python in the same build. Otherwise, it'll be a build error because different actions will attempt to create the same file with different content.
if we want to support .pyc-only mode, we can't be stuffing it into a pycache subdirectory.
Good catch. This would put pyc-only builds and multi-version builds at odds. From what I'm reading, the foo.<magic>.pyc is only read from __pycache__ directories, but plain foo.pyc is still read from along side foo.py. We can't know if multiple versions are going to be used, so we can't decide automatically.
The options I see here are:
- For PYC_ONLY mode, document the potential risk with multiple versions in the same build. cest le vie
- For PYC_ONLY mode, use unchecked hash mode and put a no-op .py file so
__pycache__is used. - Have separate PYC_ONLY_ONE_VERSION and PYC_ONLY_MULTI_VERSION settings. One could futz with the import system to make multi-version pyc-only work.
In any case, I think we can just target py+pyc initially.
In that respect I think ideally we'd want to be able to defer the decision of whether or not to go .pyc-only until later, e.g. a packaging action that might want to be able to build a .pyc-only "release" as well as a "debug" package that includes the .pys.
This is an interesting idea, but I'm not sure how it can be well accommodated. The two ways it could be done is using an aspect or transition. An aspect is the more natural way for e.g. a higher level rule, like a packaging one, to add extra information to its dependencies, but the issue with an aspect is it would conflict with the action performed by py_library. It could work OK if the py_library had any pyc-actions suppressed.
It might work better for such a rule to do a split transition.
Looking at how some other rules handle this might be informative (the case that comes to mind is c++; iirc, there's a combination of build flags, compiler flags, and providers that determine and provide debug information)
where it needs to get added in to the py_library. Ideally we'd carry it through the PyInfo provider.
Yes, it belongs in PyInfo.
iirc, maybe_precompile() will put it into PyInfo.transitive_srcs (it also goes into runfiles). I'm a bit mixed on this because I'm torn about whether transitive_srcs should only contain source files (.py files) or not. Things work OK with it having pyc files in there, and splitting out a separate depset just for pyc files seems overkill, i guess? I didn't find much indication that a lot of thought went into this question originally.
should we support "optimized" (.pyo) mode ... skip for initial implementation
Yeah, I agree. Skip for the initial implementation. Note that .pyo files were dropped and that info is part of the file name after the magic tag (the optimize arg control it). See https://peps.python.org/pep-0488/
This would be easy to add, though: just add a second arg to set the compilation optimization level and pass it onto the action.
leave out the "magic tag" altogether, ... worst case you ... it just rejects and ignores the .pyc From what I'm reading, the foo.
.pyc is only read from pycache directories, but plain foo.pyc is still read from along side foo.py
😅 Wondering if there's docs for when plain foo.pyc should also be read (or perhaps I'm misinterpreting) ? Doing some testing, I'm not quite sure placing the same .py & .pyc file in the same dir actually will result in the pyc file being used.
toy example
in a directory we have the following
ls -a
. .. main.py ryang.py ryang.pyc
ryang.py:
def hello():
print("hi")
ryang.pyc : manually generated pyc via compile
main.py
from ryang import hello
If we run the following file, we see from the trace that we don't actually read from the "magic tag"
python3 -v main.py
Python 3.10.13 (main, Aug 25 2023, 13:20:03) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
# code object from /tmp/ryang.py
# created '/tmp/__pycache__/ryang.cpython-310.pyc'
import 'ryang' # <_frozen_importlib_external.SourceFileLoader object at 0x7f1c292c00a0>
What is this line for?
yeah 😓
Wondering if there's docs for when plain foo.pyc should also be read
pep-3147 has a handy flow-chart:

This is an interesting idea, but I'm not sure how it can be well accommodated.
What I had envisioned looks more or less like this:
- PyInfo maintains four sets of transitive sources:
- py-only sources
- pyc-only sources
- py sources where there's also a pyc
- pyc sources where there's also a py
- Down-stream consumer rules always consume 1 and 2, but have a choice whether to take 3, 4, or both.
ah I see, python will only import the pyc in the same directory as the py file if and only if the original py file doesn't exist.
For the consumer speedup at import time, we would only get speedups (meaning no need to generate a pyc file again) for the following
- PyInfo maintains four sets of transitive sources:
- py-only sources --> no speedup
- pyc-only sources --> speedup
- py sources where there's also a pyc --> no speedup
- pyc sources where there's also a py --> no speedup
Wondering what downstream users would use options 3/4 for?
~That's not quite true. I think it'll still foo.pyc from the same directory as foo.py if there isn't a __pycache__/ and it isn't able to create one, which would normally be the case in a bazel sandbox.~
Regardless, if you are using the __pycache__ path, you you should be seeing a speedup so long as you're generating it correctly. Meaning, not using timestamp-based invalidation. Setting SOURCE_DATE_EPOCH will make the timestamp reproducible but will also make the timestamp always be older than the source file, so python won't ever use the .pyc. Using CHECKED_HASH, it doesn't need to parse the source file but does still need to read it, so the speedup on a modern system that is usually I/O-bound will be minimal. Though, in a lot of bazel contexts it may be putting everything on tmpfs in which case the difference will be more noticeable. UNCHECKED_HASH should see a speedup regardless of whether the .py sources are present.
UNCHECKED_HASH should see a speedup regardless of whether the .py sources are present.
hm... maybe my testing methodology is wrong or perhaps I'm missing something?
Using the site-packages for debugging purposes, with pyc only, I see a speedup for bazel test (we're using linux-sandbox) spawn strategy
i.e
import torch
pyc only
def maybe_precompile(ctx, srcs): """Computes all the outputs (maybe precompiled) from the input srcs.
See create_binary_semantics_struct for details about this function.
Args:
ctx: Rule ctx.
srcs: List of Files; the inputs to maybe precompile.
Returns:
List of Files; the desired output files derived from the input sources.
"""
pycs = []
for src in srcs:
if "site-packages" not in src.path:
pycs.append(src)
continue
basename = src.basename
dirname = src.path.rsplit("/", 1)[0].split("/", 2)[2]
# pyc_out = dirname + "/__pycache__/" + basename.replace(".py", ".cpython-310.pyc")
pyc_out = dirname + "/" + basename.replace(".py", ".pyc")
pyc = ctx.actions.declare_file(pyc_out)
args = ctx.actions.args()
args.add("-c")
args.add("""from sys import argv;from py_compile import compile, PycInvalidationMode;compile(argv[1], cfile=argv[2], invalidation_mode=PycInvalidationMode.UNCHECKED_HASH)""")
args.add(src)
args.add(pyc)
runtime=ctx.toolchains["@bazel_tools//tools/python:toolchain_type"].py3_runtime
ctx.actions.run(
inputs = [src],
outputs = [pyc],
executable=runtime.interpreter,
arguments=[args],
tools=[runtime.files]
)
# pycs.append(src)
pycs.append(pyc)
return pycs
pyc + py doesn't see a speedup. Even if I set the interpreter to use PYTHONDONTWRITEBYTECODE=x which disables pyc creation upon import, I don't see a speedup 🤔
pyc + py
def maybe_precompile(ctx, srcs): """Computes all the outputs (maybe precompiled) from the input srcs.
See create_binary_semantics_struct for details about this function.
Args:
ctx: Rule ctx.
srcs: List of Files; the inputs to maybe precompile.
Returns:
List of Files; the desired output files derived from the input sources.
"""
pycs = []
for src in srcs:
if "site-packages" not in src.path:
pycs.append(src)
continue
basename = src.basename
dirname = src.path.rsplit("/", 1)[0].split("/", 2)[2]
# pyc_out = dirname + "/__pycache__/" + basename.replace(".py", ".cpython-310.pyc")
pyc_out = dirname + "/" + basename.replace(".py", ".pyc")
pyc = ctx.actions.declare_file(pyc_out)
args = ctx.actions.args()
args.add("-c")
args.add("""from sys import argv;from py_compile import compile, PycInvalidationMode;compile(argv[1], cfile=argv[2], invalidation_mode=PycInvalidationMode.UNCHECKED_HASH)""")
args.add(src)
args.add(pyc)
runtime=ctx.toolchains["@bazel_tools//tools/python:toolchain_type"].py3_runtime
ctx.actions.run(
inputs = [src],
outputs = [pyc],
executable=runtime.interpreter,
arguments=[args],
tools=[runtime.files]
)
pycs.append(src)
pycs.append(pyc)
return pycs
😅 I was under the impression that py + pyc in the same directory would not result in speedups https://peps.python.org/pep-3147/
No, you're right, I missed that bit. So if .py is present, .pyc must be in __pycache__, and otherwise it can't be. That does certainly make it more difficult to defer the decision of pyc-only vs py+pyc.
I think we have the following paths:
- pyc + py --> we'd need to use the magic tag + pycache directory
- pyc only --> no need for magic tag, but we can only support one version of python
- py only --> default behavior
Going back to why would pyc be useful and tradeoffs of generating pyc, I think the pros / cons of each decision would be
| Situation | Pros | Cons |
|---|---|---|
| pyc + py | import speedups | complexity with magic tags |
| pyc | simple to implement | issues with multiple python version and would probably require some sort of opt in functionality since this might break downstream consumers |
| py | default behavior | default behavior |
Coming at this from a python rules consumer perspective, It would seem the first option of pyc + py just moves the pyc creation stage earlier rather than at runtime (import time)
Default behavior (unless one uses a cli arg otherwise, is to generate pyc files in pycache upon import) https://docs.python.org/3/using/cmdline.html#cmdoption-B
Wondering what the use case for pyc only is (just curious 😅)?
There's basically three use cases for pyc-only.
-
The bad one is if you're trying to keep your source code secret from a casual investigator. It won't do a great job of that, because reverse-compiling is pretty easy, but at least keeps search tools from indexing it.
-
Maybe you just want to put a speed bump in the way of your customers who know just enough python to get themselves into trouble making changes to your code to "fix" things they perceive as bugs and then sending problem reports to your support team with stack traces that don't make any sense any more. Not that I'd have any experience with that sort of thing...
-
The third, and probably best, use case would be to reduce the size and file count of the package you are distributing, particularly for deployment in k8s or serverless contexts. It would also probably help in a
bazel testcontext, particularly with remote execution. That's the kind of use case where you might want pyc-only for anoptbuild but would want to keep the sources around for adbgbuild.
Assuming we want to solve for the use case of
- pyc only ( will need to document caveats for multi python version build environments)
- pyc + py (default behavior)
Would consumers have something like? As well as some sort of flag that can set this attribute for all py_libraries? i.e https://github.com/bazelbuild/rules_python/blob/e86252ffd6d1a1bf32ae99933acc5ab49b78ec1e/python/private/common/attributes.bzl#L163?
py_library(
name = "<name>",
srcs = ["src.py"],
deps = ["<deps>"],
pyc_only = true, # default is pyc + py? (using magic tag + pycache)
)
If we were to do it this way, I think we'd need to make changes in a couple of areas? (haven't worked too much with starlark rules, so I may be doing something wrong here)
Would we need to
I don't think we'd want the attribute to be boolean; as discussed above I think there's probably 4 modes of interest:
- No compilation (
.py-only) .py+__pycache__/*.pycwithCHECKED_HASHmode..py+__pycache__/*.pycwithUNCHECKED_HASHmode..pyc-only.
The difference between CHECKED_HASH and UNCHECKED_HASH is probably something we'd want to set at the toolchain level, though I can imagine situations where you might want to be more selective.
I'd probably call it compile_mode or maybe pyc_mode or even just pyc.
It would probably make sense for the initial default to be 1, for backwards compatibility, though really we'd also probably want the default to be "default", meaning use whatever was set on the toolchain configuration[^1], so users can configure it globally but override locally.
So, the place to start would probably be with the toolchain configuration. And for an MVP at least, we can just leave it there and not yet expose options on a per-library basis.
We don't actually have to change PyInfo if we do it this way (non-deferred decision as to whether to compile).
However, if we allow ourselves to make changes to PyInfo, we can defer the mode decision until later on. In that case, we'd need PyInfo to keep track depsets of transitive files for
pyc-only distribution:.pycfiles where the.pyfiles not in__pycache__, except for targets where compilation was disabled, in which case we need the.pyfiles.py-only distribution files. This would include anypycfiles which might have been configured forpyc-only mode, e.g. something the author is especially interested in obfuscating.__pycache__/*.pycfiles corresponding to.pyfiles in 3. Downstream targets that wantedpy+pycmode would use this together with 3.- (optional)
__pycache__/*.pycfor.pyfiles wherepyc-only was disabled butpy+pycwas allowed, if we want to support that. Apyc-only distribution might choose to use this or not depending on their size/speed tradeoff calculation. Or we could just include this in 1 unconditionally. It probably isn't worth supporting this case, but listing it here for completeness.
A distribution could use either 1 (mostly pyc-only), 1+4 (mostly pyc-only plus compilation of whatever isn't), 2 (mostly-py-only), or 2+3 (mostly-py+pyc).
[^1]: Typically, defaults that delegate to some sort of global configuration would be using None, however a user would be justifiably confused by compile_mode = None resulting in compilation.
Yes, have a string-valued setting. The default should be "default", or "auto", or some other sentinel. This allows easier detecting of "did someone set this?", plus some freedom in implementation.
who sets checked vs unchecked hash mode? other more selective cases
Yeah, this is a tricky question. For "released" code, unchecked hash is the obvious choice. When doing development, you want checked hash (or disable pyc entirely) for sources you're going to modify, and unchecked for ones you haven't.
This also makes me think that, for something like pip.parse-generated libraries, it would make sense for those to set unchecked hash (i.e. a library-level setting), because you're really unlikely to modify those (if you are, you might as well just modify the generated build file, too).
Another case that comes to is generated code (e.g. py_proto_library). Such code has to go through a build step to see changes anyways, so adding an extra call to precompile isn't that much overhead. (In the case of py_proto_library, because protos tends to generate a lot of files, skipping the py->pyc is a net win).
I can imagine some extra set of flags to control behavior here from the "outside" (I'm thinking a label flag that points to a richer config), but lets keep it simple for now -- I think a library level setting for this makes sense (with the 5 modes listed: auto, py-only, pyc-only, pyc-checked-hash, pyc-unchecked-hash).