rules_python icon indicating copy to clipboard operation
rules_python copied to clipboard

Allow root module to customize toolchains

Open rickeylev opened this issue 1 year ago • 6 comments

We've had several requests where users want to modify something about the toolchain definitions that are generated for the hermetic runtimes. Typically these are small tweaks. Allowing other modules to tweak toolchains is a no-go, but allowing the root module to change things is reasonable.

The "simple" option for customizing toolchains is to add args to python_register_toolchains and python.toolchain.

The two main issues I see with adding args is:

  1. bzlmod APIs tend to be "written in stone". They happen very early in the build process, so we have limited options for evolving the API. In this case, this is somewhat tempered by only the root module being allowed to use it, but it still makes me uneasy.
  2. Toolchains tend to have unstable APIs. Most parts of them are just implementation details for the rules. For example, adding a single arg to customize the bootstrap template made sense last year, but now there are multiple template files. Similarly, the stub_shebang attribute is slated for removal with the introduce of --bootstrap_impl=script.

Adding args to python.toolchain is, I think, a bad idea. Having something else, e.g. python.root_config might make a bit more sense. That said, args on python.toolchain() make it easy to associate something to a particular toolchain, while python.root_config would need some arg to specify which toolchain it wanted to restrict the setting to.

This issue is to collect the use cases and try and figure out some options.

Things people want to change:

  • [ ] Overriding bootstrap template

    • Example: https://github.com/bazelbuild/rules_python/pull/2032
      • Goal: They want to implement a separate bootstrap
      • The PR proposes adding args to e.g. python_register_toolchains().
      • Issues: (1) Would need to be exposed via bzmod apis. (2) There are 4 bootstrap files: python_bootstrap_template (legacy system python bootstrap), stage1 script bootstrap, stage2 bootstrap, and zip main bootstrap. Plus the launcher exe thing for windows.
      • Proposal: Indirect the references to the templates using flags.
  • [ ] Customizing stub_shebang: https://bazelbuild.slack.com/archives/CA306CEV6/p1721464127008299

    • Goal: They want to set -S PYTHONNOUSERSITE=1 for all their Python invocations
    • Proposal 1: Have a flag override for stub_shebang
    • [x] Proposal 2: Add interpreter_args and make env populate values into the bootstrap.
    • [ ] Proposal 3: Have flags for "default interpreter args" and "default interpreter env"
  • [x] The register_coverage_tool arg

    • Goal: Having coverage included as part of the toolchain is unnecessary if a user test provides the coverage library itself
  • [ ] Control whether local or hermetic runtimes are used

    • Goal: Provide a way for the root module to decide whether a local or hermetic toolchain is used.
    • Basically a local=True|False arg when defining a toolchain. It decides whether the hermetic runtime repo rule or local runtime repo rule is used under the hood.
    • Proposal: Have a flag that acts as a constraint for matching toolchains.
  • [ ] Control whether runtimes are in-build or platform runtimes

    • Goal: A platform runtime is cheaper to setup; while not compatible with RBE or sandboxing, that's fine if you aren't using RBE or sandboxing.
    • This mostly applies to local runtimes, but could also apply to hermetic.
    • Basically, have an inbuild=True|False arg when defining a toolchain. It controls whether the underlying py_runtime() targets generated use interpreter= or interpreter_path=
  • [x] Control constraints of toolchains

    • Goal: Allow registering both local and hermetic runtimes. Using constraints, local or platform toolchains can be used for local invocations, and hermetic for RBE.
    • [x] local toolchains can have arbitrary constraints set
    • [x] regular (python-build-standalone) can have their constraints customized
  • [x] Control what toolchain versions are available

    • Goal: An infrastructure team wants to ensure only e.g. Python 3.11 (or a small set of proscribed versions) is usable at their company.

rickeylev avatar Jul 20 '24 15:07 rickeylev

Wanted to create a separate issue for this but realized that this is already there. We could just create a python.override tag class for the majority of the things listed above, but for starters we could start with customising the URLs for the python toolchain override.

Currently the TOOL_VERSIONS used by the extension cannot be overridden via MODULE.bazel. The only way users have right now is to use the bazel_downloader config, but that may give hard to debug errors and still does not allow users to add Python builds for extra platforms that are not present on the indygreg website.

Initial solution could be a python.override tag class that can override the URLs/structs for python toolchains that are used. The design of the API should allow for future extension for different overrides:

  • restricting allowed python versions
  • setting the x.y->x.y.z mapping
  • disable registering all versions
  • etc

aignas avatar Aug 08 '24 06:08 aignas

I'm taking a stab at customizing the target_compatible_with and target_settings parts of a toolchain. API I'm looking at is to add args to python.single_version_platform_override, so one can do e.g.

python.single_version_platform_override(
   platform = "linux_x86_64-debug",
   url = ".../cpython-3.12.0-debug-full.tar.gz",
   python_version = "3.12.0",
   target_compatible_with = ["@platforms//os:linux", ...],
   target_settings = ["//:is_debug_enabled"]
)

The net effect being: you can pick any of the python-build-standalone URLs and make it available with whatever custom flags you desire. Similarly, you can use any URL for an archive that extracts out to something structurally compatible with with python-build-standalone does.

rickeylev avatar May 12 '25 22:05 rickeylev

Ideally it would be great to start using the MODULE.bazel configuration as the source of truth rather than the versions.bzl file. Maybe we could have the versions.bzl PLATFORMS constant updated automatically based on what we define in the MODULE.bazel?

Though I am not sure where this belongs. The more I think about it, the more I realize that the target python platforms that we want to define have to be somewhat similar/related between the pip and python extensions. What do you think?

aignas avatar May 14 '25 04:05 aignas

MODULE.bazel as source of truth

Yes. I'll take it a step further: I like the idea of fetching a remote config of all available toolchains (i.e. all ~500 python-build-standalone runtimes). But, by default, we only register one toolchain -- what works on the host. If people want more than that, they have to opt-in somehow (e.g. MODULE.bazel changes). Similarly, ideally, waving the magic wand 🧙 some more, the set of toolchains built into rules_python source code is only the most recent supported python versions for populate platforms. Anything beyond that requires the remote manifest or user opt-in

But, yeah, quite a bit of refactoring and work needed to get to that point. In the meantime, the upcoming PR to let single_version_platform_override add mostly-arbitrary PBS runtimes will go a long ways.

PLATFORMS is updated based on what is defined

Yeah, this is pretty close to my followup PR from the recent refactor PR to python.bzl. The design of "there's a dict of platform->settings; a toolchain has a platform key, which is looked up in that dict" is baked pretty deeply into various places. My upcoming PR adds a second platforms dict that single_version_platform_override adds into, the two are merged, and that merged result gets passed down to functions that need the platform dict (instead of using the PLATFORMS constant).

Ideally, I don't think we should be defining these platform-description-strings. The values we have today mostly just derive from the python-build-standalone file names and the need to have a unique, valid, repo name for the downloaded runtime.

the toolchain platforms have to be similar to the pip platforms

How so? Do you mean there is some code in the pip-integration that looks at the PLATFORMS global to do something-or-other? Ah, maybe the part where pip.parse tries to automatically lookup the correct interpreter? The need for that should go away with pipstar, right? It'll only be needed for sdist building -- and if we move that to build phase, then that need will go away (and we're only left with the problem of setup_requires et al for sdist building)

rickeylev avatar May 14 '25 17:05 rickeylev

Yes. I'll take it a step further: I like the idea of fetching a remote config of all available toolchains (i.e. all ~500 python-build-standalone runtimes). But, by default, we only register one toolchain -- what works on the host. If people want more than that, they have to opt-in somehow (e.g. MODULE.bazel changes). Similarly, ideally, waving the magic wand 🧙 some more, the set of toolchains built into rules_python source code is only the most recent supported python versions for populate platforms. Anything beyond that requires the remote manifest or user opt-in

But, yeah, quite a bit of refactoring and work needed to get to that point. In the meantime, the upcoming PR to let single_version_platform_override add mostly-arbitrary PBS runtimes will go a long ways.

Yeah this is a big one. I have the following concerns about the feasibility:

  • If transitive modules can add configuration it will be a mess.
  • If the root module has to provide configuration for transitive modules, it will be a mess as well.
  • Providing toolchains from a single python-build-standalone release is probably an easy to understand compromise.
  • Doing only for a particular host, I think I like the idea in theory. If you need extra platforms, you need to do more, but in practice adding only the host platform has to happen in the repository_ctx and that is not great. It is much easier to just add everything for a particular version and use target_settings to discriminate.

The values we have today mostly just derive from the python-build-standalone file names

Yeah, this riffs to what you are saying above. Maybe we could split the hermetic python toolchain into a separate module, where people can easily register stuff? If pip extension did not depend on Python, that would be actually a feasible architecture.

How so? Do you mean there is some code in the pip-integration that looks at the PLATFORMS global to do something-or-other?

May thinking was that the platform description (the thing that you said you would like to leave out) is right now done in three places uv extension, python extension and pip. In pip with pip-star it has to be done in 3 places: the whl_library marker_setting which selects the deps based on the target configuration, where we evaluate expressions in the requirements files (currently we use Python interpreter, but we could ask the user to provide a target platform configuration if they provide a requirements/lock file with markers). The third location is where the wheel filenames are mapped to config settings - we depend on the fact that manylinux_x86_64 maps to os:linux, cpu:x86_64 and a few extra flags. We don't depend on the platform definitions, but in order to filter out wheels that the user does not want/need, we have to assume the target platform triples (i.e., the cp39_linux_x86_64 and similar).

This is why I was thinking that the target platform concept would be nice, because that would allow us to include/create only the actually required whl_library instances. That is probably where I would love to know the incantation that I need to say with the magic wand here.

aignas avatar May 15 '25 07:05 aignas

Update to this feature: as of v1.5, python.single_version_platform_override allows registering arbitrary python-build-standalone (or compatible) URls for arbitrary platforms/flag combinations. The API name is a bit misleading: it can override or add runtimes.

rickeylev avatar Jun 13 '25 20:06 rickeylev