rules_python
rules_python copied to clipboard
Need a method to build native libraries for pip imports
This is probably a long-term / aspirational wish, though I'd be happy to be shown otherwise.
Packages installed via pip
often depend on native libraries. As a random example, https://pypi.python.org/pypi/cairocffi is a set of python bindings around the native cairo
library. The bindings themselves don't include the native library, but they do require it to be installed on the system in a way which can be found by setuptools
when compiling native code to build the relevant wheel.
It seems like requiring a system-installed library (which could change between builds) would violate hermeticity. We should instead provide a way to specify the dependent libraries required for a given pip import as Bazel targets, which can themselves be build.
This is likely related to bazelbuild/bazel#1475
https://github.com/bazelbuild/rules_python/pull/61
Some discussion here: https://groups.google.com/forum/#!topic/bazel-sig-python/ESnkuJ4udkc
Another scenario I'm struggling with: building *.so's for PyTorch. PyTorch installed through rules_python works just fine. I can also compile programs for libtorch
(installed using workspace declaration) and they work OK. What doesn't work is compiling PyTorch extensions and then using them directly with the Bazel-installed PyTorch - you get all sorts of undefined symbols errors.
If I use libtorch
as a dep for the extension (which works fine for C++ code) it seems to not be able to find the libs of transitive deps unless I provide LD_LIBRARY_PATH
separately. I did verify that the libs themselves are where you'd expect them, and ldd
shows nothing as "missing".
If I set things up such that the libs that I'd normally get from libtorch
are taken directly from PyTorch site-package/torch/lib
, things work OK.
So I need a way to get at both the headers and *.so's that are bundled inside the PyTorch wheel installed by Bazel. Currently, to the best of my knowledge, these are not exposed. There should be an option to expose them though.
@1e100 I met the same issue, do you find any solution?
I'm going to patch rules_python
to optionally expose the include
and lib
subdirectories, globbing all *.so's in the latter case and all headers in the former. It'll be only for PyTorch though, super ugly. I'm not sure what a general purpose solution would look like here. Normally you don't want to spill the guts of a lib like that. You want to expose the bare minimum necessary.
@yijianli-autra, after some amount of hacking I was able to accomplish what I wanted without patching, in part because rules_python supports injection of build rules into PIPs via annotations
property on pip_parse
. Basically, this option lets you inject whatever rules you like into whatever PIPs you like, and get to their internal files, headers and *.so's.
Here's what I did:
In WORKSPACE
add something like below to your pip_parse
declaration:
PIP_ANNOTATIONS = {
"torch": package_annotation(
additive_build_content = """\
cc_library(
name = "libtorch",
hdrs = glob([
"site-packages/torch/include/torch/**/*.h",
"site-packages/torch/include/c10/**/*.h",
"site-packages/torch/include/ATen/**/*.h",
"site-packages/torch/include/caffe2/**/*.h",
]),
srcs = [
"site-packages/torch/lib/libc10.so",
"site-packages/torch/lib/libtorch.so",
"site-packages/torch/lib/libtorch_cpu.so",
"site-packages/torch/lib/libcaffe2_nvrtc.so",
],
includes = [
"site-packages/torch/include",
"site-packages/torch/include/torch/csrc/api/include"
],
deps = [
"@pybind11",
"@local_config_cuda//cuda",
],
linkopts = ["-ltorch", "-ltorch_cpu", "-lc10", "-lcaffe2_nvrtc"],
copts = ["-D_GLIBCXX_USE_CXX11_ABI=0"],
)
""",
),
}
pip_parse(
name = "pypi",
annotations = PIP_ANNOTATIONS,
requirements_lock = "//:requirements.txt",
)
load("@pypi//:requirements.bzl", "install_deps")
install_deps()
Some of the stuff in the rule may be unnecessary - this is a result of a lot of stumbling in the dark and trying to compile. CUDA
is provided by TensorFlow's workspace rules, so hypothetically with some elbow grease this might also work for ROCm (AMD) as well.
- In your library BUILD rule:
cc_library(
name = "cpu",
srcs = [
"dcn_v2_cpu.cpp",
"dcn_v2_im2col_cpu.cpp",
"dcn_v2_psroi_pooling_cpu.cpp",
],
hdrs = [
"dcn_v2_im2col_cpu.h",
"vision.h",
],
deps = ["@pypi_torch//:libtorch"],
copts = ["-D_GLIBCXX_USE_CXX11_ABI=0"],
)
- *.so's are binaries, not libraries, so it'll need a
cc_binary
rule as well:
cc_binary(
name = "dcn_v2.so",
srcs = [
"dcn_v2.h",
"vision.cpp",
],
linkshared = True,
linkstatic = False,
linkopts = ["-Wl,-rpath,$$ORIGIN"],
deps = [
"//third_party/dcn_v2/src/cpu",
"@pypi_torch//:libtorch",
],
copts = ["-D_GLIBCXX_USE_CXX11_ABI=0"],
)
Again, some of this might not be necessary.
- And then you have to load it into Python, so in your
py_library
rule you go:
py_library(
name = "dcn_v2",
srcs = glob(
[
"*.py",
],
exclude = ["**/*_test.py"],
),
deps = [
requirement("torch"),
],
data = ["//third_party/dcn_v2/src:dcn_v2.so"],
)
Obviously modulo your own paths and library names.
- And finally in your Python library:
torch.ops.load_library("third_party/dcn_v2/src/dcn_v2.so")
dcn_v2 = torch.ops.dcn_v2
Note that I haven't yet got this to work with the actual CUDA libs, but not because it's failing - I simply didn't get to it yet. CPU extension builds, loads, and passes the tests.
Is there any idea of how to accommodate this in a bzlmod world as well? Without the flat repository namespace, just adding additional BUILD files becomes less viable for sharing dependencies across package managers. Say, using a multi-language repository and sharing a common zlib dependency between PIP-installed Python and C++.
@shahms can you elaborate more?
I'm not sure which aspect you want elaboration on, but consider a repository which contains both C++ and Python and uses protocol buffers from both. There is a PIP protobuf package for the Python side, but this package needs to invoke protoc
(written in C++) to generate the Python code and this binary must generally be kept in sync with the C++ protobuf library in use.
There are a few different (conceptual) ways of stitching this together, but they all need coordination between rules_python, pip and Bazel. When bzlmod is involved you can't easily just drop a BUILD file into the pip repository using additive_build_content_file
due to the repo mapping involved, since that pip repository won't have access to the "main" repositories. You might be able to get away with it by hard-coding the "canonical" repo names in the overlayed BUILD file, but that's incredibly fragile and fiddly. There's also the issue of ensuring that pip uses that dependency rather than trying to fetch it itself.
Is it possible to use toolchain resolution to solve this problem? @fmeum @Wyverald
@meteorcloudy I think that relying on toolchain resolution for this would be overkill. The same patch that one would apply in a WORKSPACE setup should also work with Bzlmod.
For that, we may need to make a "reverse use_repo
", let's call it make_visible_to
, available in MODULE.bazel
that can be used to inject a repo mapping entry into an extension (limited to the root module's MODULE.bazel
file for non-isolated extension usages to prevent name clashes).
@tyler-french recently approached me with a similar situation: He needed to patch a Go repository managed by the go_deps
extension to add a dependency on Tink. If the root module adds a dep on Tink via bazel_dep(name = "tink", ...)
, then make_visible_to(go_deps, com_github_google_tink = "tink")
could make this repository visible to repositories generated by go_deps
as com_github_google_tink
.
For that, we may need to make a "reverse use_repo", let's call it make_visible_to
Hmm, I'm not sure how difficult it is to implement this. Can we somehow just pass a resolved canonical label through module extension tags?
We can pass a canonical label to an arbitrary target visible from the root module into the extension via tags, but that has some serious usability problems:
- Users would need to make up a target to pass in the repo name (
go_deps.additional_repo(name = "com_github_google_tink", label = Label("@tink//:does_not_need_to_exist)
). - From within the content of a patch or BUILD file, it isn't clear how to reference that label. Textual replacement of
@com_github_google_tink
with"@@" + tink_label.workspace_name
is an option, but seems fragile. - Every extension with the ability to patch or add BUILD file contents would need to build in this functionality, which could result in fragmentation.
@fmeum I ran into a similar situation with Go as well, specifically with https://github.com/google/brotli/tree/master/go where the root module depends on the C brotli library. In that case it was easy enough to patch the Go library BUILD file to use the canonical name from the root module as that dependency wasn't using modules yet.
Something like make_visible_to
seems like a potential avenue here.
I think the fundamental issue is around bridging Bazel-managed module namespaces and pip, go, cargo, etc. While the rules in question will need some way of consuming those dependencies from Bazel, Bazel will need to provide a mechanism of making them available. But that all assumes (so far) that the root module already has a direct dependency on that shared dependency which it can make available.
It seems like this is https://github.com/bazelbuild/bazel/issues/19301