Allow using arbitrary wheel files with arbitrary constraints in pypi integration
Sometimes you want to use a custom wheel that can't be represented by Python packaging constraints, or simply want to use a particular whl at a particular location for your own testing, development, or reasons.
The particular case I have in mind is pytorch and building pytorch from source. Doing this presents some problems
There are (at least) 5 different distributions of PyTorch for different accelerators (cuda 11.8, cuda 12.6, cuda 12.8, rocm 6.3, and cpu). Unfortunately, environment markers can't represent these conditions, so it's not possible to express which of these "torch" should resolve to in a requirements or pylock file.
The above have public URLs, but using a local file is also desirable in some cases.
- In JAX: In some tests, they want to build a wheel (using bazel), and then use that wheel in some tests (also run by Bazel)
- In Pytorch XLA: They want to build Pytorch manually, then use that for the torch dependency.
- I've seen various slack posts of people building torch (or other ML, C++ heavy things) manually.
Local files would also be helpful for our own testing -- we could generate exactly what we needed without incurring the overhead of remote fetching.
To make this work, we basically need to mixin additional settings to the hub's routing aliases. Ultimately, we want to generate something like this in the hub:
# File: @pypi//torch:BUILD.bazel
alias(
name = "torch",
actual = select({
"@user//:is_torch_1": "@pypi_torch_cuda_11.8//:pkg",
"@user//:is_torch_2": "@pypi_torch_cpu//:pkg",
"//conditions:default": ":_default",
})
)
alias(
name = "_default",
actual = <select that is generated today>
)
I'm not entirely sure how to end up there, though. I'm thinking a pip.override() API that takes the conditions and their destinations
pip.parse(
hub_name = "my_pypi",
requirements = "requirements.txt"
)
pip.override(
hub_name = "my_pypi",
package = "torch",
config_setting = ["@user//:is_torch_cuda_11.8"],
urls = ["https://torch.com/torch-cuda-11.8.whl"],
)
pip.override(
hub_name = "my_pypi",
package = "torch",
config_setting = ["@user//:is_torch_cpu"],
wheel = "@user//:torch-cpu.whl"
)
Under the hood, each wheel/url value turns into a repo, generating a whl_library-compatible repo (i.e downloads and extracts it). The config_settings are fed into whatever generates the hub code's select routing.
Alternative: don't do the repo creation part. Just plumb through the config condition and the repo name. Forcing users to create the repo doesn't feel ideal. We'd probably want to provide some sort of helper for that (but not whl_library directly -- its API is full of internal details).
There's two other pieces of the system where I'm not sure how the interaction will work:
(1) experimental_index_url. IIUC, this works by traversing the simpleapi graph to find a whl that satisfies. If we are providing our own wheels separately, how do those fit into the process?
For example, maybe the simpleapi doesn't find a compatible wheel, but that's expected because we're providing our own wheel?
Or: if we know we're going to use our own wheel, then traversing through the index (for that package) is wasted effort.
(2) To support conditions that python packaging can't express, an idea we had was having multiple requirements.txt files with a select() layer that chooses between them. e.g.
pip.parse(
hub_name = "bla",
requirements = "requirements-cpu.txt"
condition = "@//:is_accelerator_cpu"
)
pip.parse(
hub_name = "bla",
requirements = "requirements-cuda.txt"
condition = "@//:is_accelerator_cpu"
)
and @bla//somepkg routes to @bla_X_somepkg or @bla_Y_somepkg.
Which looks pretty similar to my proposal above, just a different level of granularity.
cc @aignas
Under the hood, each wheel/url value turns into a repo, generating a whl_library-compatible repo (i.e downloads and extracts it). The config_settings are fed into whatever generates the hub code's select routing.
This can work pretty well. The hub_repository now accepts a dict where we serialize the the config settings users pass to JSON. The part that we would have to change right now is to respect user's config_setting if it is not done already - can't remember about that. Also ensuring that we pass Labels correctly would be something.
Alternative: don't do the repo creation part. Just plumb through the config condition and the repo name. Forcing users to create the repo doesn't feel ideal. We'd probably want to provide some sort of helper for that (but not whl_library directly -- its API is full of internal details).
This won't work because of bzlmod visibility. rules_python has no visibility about what repos users create.
(1) experimental_index_url. IIUC, this works by traversing the simpleapi graph to find a whl that satisfies. If we are providing our own wheels separately, how do those fit into the process?
Depends what in the end we want to do. Do we just want to add things or replace things?
- Add things - ignore experimental_index_url, just add the
whl_libraryrepos and add thewhl_config_settingto be passed to the hub. - Replace things - add the
whlname intoskipattribute for thesimpleapi_download. It should exclude the package forcing the user to do the propperpip.overridefor all of the wheels they are interested in.
(2) To support conditions that python packaging can't express, an idea we had was having multiple requirements.txt files with a select() layer that chooses between them. e.g.
This would be #2548. I hope once I have time to refactor the API, this can be easier. Again, the main work is about ensuring that the way whl_config_setting is serialized/deserialized works with Labels (i.e. use str(Label(x)) before serializing) and that we are creating the right config settings combining what user passes and combining what we already need to apply:
- python_version
- platform os, arch
- the whl filename derived config setting if the whl filename is known.
There are (at least) 5 different distributions of PyTorch for different accelerators (cuda 11.8, cuda 12.6, cuda 12.8, rocm 6.3, and cpu). Unfortunately, environment markers can't represent these conditions, so it's not possible to express which of these "torch" should resolve to in a requirements or pylock file.
The above have public URLs, but using a local file is also desirable in some cases.
Direct url references (PEP 440) in lockfiles can support url based distributions (wheel and sdist). They also support file: and vcs: url schemes.
But yes, not in the same lockfile for the same environment and there's no standards to provide overrides on the installation side (bazel). And substitutes to bazel targets would also be a new capability.
Naturally, any time a package is forcibly swapped into a locked environment, it's an expert setting that could create invalid environments.
This won't work because of bzlmod visibility. rules_python has no visibility about what repos users create.
Ah, right, that might be an issue. A label could be used, but that'd cause an eager fetch (iirc).
It might still work, though, because a module has visibility to the repos of the module that hosts it. ie. if main uses rules_python's extension pip, then rules_python's pip repos have visibility to whatever main does. doc: https://bazel.build/external/migration#repository-visibility ; I think this is how the visibility for user-defined config settings works for the custom toolchains stuff. Not sure if it'd apply here (since these are repos were talking about, not target labels)
In any case, inject_repo would be an work around (though a root-only option)
depends what we want to do, add or replace
Hm, good question.
In the case of torch, the different wheel implementations have different dependencies (e.g. torch forcuda depends on nvidia cuda packages, while torch for cpu doesn't). So, actually, for this to be correct, there would have to be separate resolved requirements. You can't simply replace (unless you know what you're replacing it with has the same dependencies).
If you add an arbitrary wheel, then you may have missing deps, too. So either the user accepts that gap, solves it themselves (manually adding to requirements.txt somehow), or we solve it (unpack the wheel and follow requires-dist, or a bzlmod api to specify them, or etc).
Direct url references (PEP 440) in lockfiles can support url based distributions (wheel and sdist). They also support file: and vcs: url schemes.
Yeah, that's actually the hack I'm seeing right now. A custom repo rule preprocesses a requirements.txt, adds torch@/some/local/path.whl, and then that is what gets passed to pip_parse. Sort of clever, but more hacky IMHO.
The hub_repository now accepts a dict where we serialize the the config settings users pass to JSON.
This sounds promising, then. I haven't looked at the impl, but I'm thinking to have it do something similar to what I did with the python custom toolchain stuff. Have it collect a list[tuple[str package, str config_setting, str whl_url]]. Then it creates a repo for each entry and feeds it into the hub.
Yeah, that's actually the hack I'm seeing right now. A custom repo rule preprocesses a requirements.txt, adds torch@/some/local/path.whl, and then that is what gets passed to pip_parse. Sort of clever, but more hacky IMHO.
One option could be to have a PEP 517 backend for bazel then lock against wheels in bazel-out/
That would record the wheel in pylock.toml
I guess at install time for the overall build, something would need to ensure the in-tree builds had happened before all the other hub and py_library machinery we run. It might not be possible though.
It might seem hacky, but as you already picked up, if you're generating a lockfile and then swapping out nodes in the graph, you are at risk of breaking things because the metadata you locked against and the metadata you install are different.
If you're trying to "vendor" a dependency in-tree into a lockfile and ensure the lock is valid in terms of metadata and constraints, then I don't think there's any other way to guarantee correctness outside locking against what you actually install.
But of course, if you don't care about correctness, then it doesn't matter and you can just swap things out.
I spent some time thinking through different ways to provide alternative sources for external dependencies. Incoming lost post.
I came up with 4 different ideas for customizing how a package's stuff ends up in a consumer:
- Target level. e.g. setting the target that
@pypi//foo:foopoints to. - Wheel file source. i.e. setting the URL (or filepath) that
whl_library()uses - Wheel repo. e.g. setting the repo name used that
@pypi//foo:foopoints to - BUILD-macro. e.g. setting the path in the
load(<path>, "whl_library_targets")line in the pypi-generated@pypi_foo//:BUILD.bazelfile.
Also, I'm assuming the sort of thing Greg says -- this doesn't handle transitive closure changes. Not sure how to wire that in, but I think that's a separate enough problem.
Overall, I think (1) is the best because it's the most flexible and powerful option. It's the most flexible and powerful because it intercepts things at the "consumer's public api" side. By controlling where those targets point, you can then route to any Bazel-ism you want. On the pip bzlmod extension side, the API is pretty small: dict[package, dict[name, replacement_target]]. i.e. {"foo": {"pkg": "@my//:foo"}} means "route @pypi//foo:pkg to @my/:foo.
If the user wants custom config conditions, they can put them on the replacement target. Similarly, if they have some custom repo rule doing whatever, they can point it there, too. Or if they need to intercept and add shims, munge files, etc etc, they can do so.
The one complication I see is we probably want a way to route back to the original pypi targets? i.e. to allow:
pip.overrride("@pypi//foo:pkg", "@my//:foo")
# my/BUILD
alias(name="foo", actual=select({
":is_use_custom_foo_enabled": "@//my:custom_foo",
"//conditions:default": "@pypi//foo:original_foo",
})
Which should be easy. Just create some extra targets in the hub for that.
(2) is really user-friendly because it addresses a very common case. i.e. the work to make (2) doable is worthwhile. Because Python packaging can't represent e.g. accelerators, library versions, etc, it's common for those to live at some other URL.
Maybe wheelnext.dev will solve this eventually, but in the mean time we need a solution.
(2) also requires the pip bzlmod extension to ask for some more info up front. Under the hood, it has to keep a list of (str distribution, list wheel_urls/paths, list config_settings) to remap things. This is comparable to the mapping (1) needs.
For (3), the idea is the user would define their own repo somehow (e.g some stripped down version of whl_library) that generated a set of
(3) is just a special case of (1). It has the limitation that a repo has to have a specific layout -- this prevents e.g. mapping a target into the main repo directly. Creating a repo to map where you want is a work around, but that's just (1) with more work.
(4) Seems too low-level. While we made the whl_library_targets function amenable to being replaced, it has a large API and is pretty brittle. It does have the advantage of controlling the entirety of what the generated BUILD file looks like, though. It's somewhat closely coupled to whl_library, which assumes a wheel is being extracted. This makes some alternative that doesn't come through the wheel transform harder.
I would be +1 for (2) because it is almost doable with what we have today. For the mapping, the (distribution ,list wheel_urls/paths, list config_settings) I think it should work well, because we have requirements_by_platform where user friendly platform strings map to a list of whls.
- In here we could ask the user to provide the wheel for a user platform string and that can translate to the
config_setting. - We could also ask the user to provide a label where the name of the label is the name of the file. This is basically the same what https://bazel.build/rules/lib/builtins/repository_ctx#path mandates.
As part of #2747 I want to revamp the whl selection algorithm to simplify it, with that done the config_setting generation from the user friendly platform string should work well.