bazel icon indicating copy to clipboard operation
bazel copied to clipboard

TensorFlow is broken with Bazel@HEAD

Open meteorcloudy opened this issue 1 year ago • 12 comments

https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/2640#01835384-a5a8-4cef-839a-92041e93f6fb

(03:29:23) ERROR: /var/lib/buildkite-agent/builds/bk-docker-062w/bazel-downstream-projects/tensorflow/tensorflow/BUILD:1409:19: Executing genrule //tensorflow:tf_python_api_gen_v2 failed: (Aborted): bash failed: error executing command (from target //tensorflow:tf_python_api_gen_v2)
  (cd /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/18887e87f83f3c86081828342b40a00a/execroot/org_tensorflow && \
  exec env - \
    PATH=/var/lib/buildkite-agent/.cache/bazelisk/local/-tmp-tmppfqvvf95-bazel/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
    PYTHON_BIN_PATH=/usr/bin/python3 \
    PYTHON_LIB_PATH=/usr/lib/python3/dist-packages \
    TF2_BEHAVIOR=1 \
  /bin/bash -c 'source external/bazel_tools/tools/genrule/genrule-setup.sh; bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2 --root_init_template=tensorflow/api_template.__init__.py --apidir=bazel-out/k8-opt/bin/tensorflow_api/v2/ --apiname=tensorflow --apiversion=2  --compat_apiversion=1 --compat_apiversion=2  -- 
...
...

Culprit finder: https://buildkite.com/bazel/culprit-finder/builds/3607#01834b10-64ec-4e64-812f-2357b4ccd39a Bisect shows dadc49e437018f482640ed76fae5307daf9911a8 is the culprit

meteorcloudy avatar Sep 19 '22 07:09 meteorcloudy

/cc @c-parsons @oquenchil Can you look into this? This is blocking 6.0 cut since it's breaking TensorFlow.

My guess is https://github.com/bazelbuild/bazel/commit/dadc49e437018f482640ed76fae5307daf9911a8 somehow caused the command line to be too long?

If dadc49e437018f482640ed76fae5307daf9911a8 is valid, maybe the fix should be from the TensorFlow side?

meteorcloudy avatar Sep 19 '22 07:09 meteorcloudy

It's unclear to me how dadc49e could be causing this failure, both by looking at the failure and looking at the surrounding builds.

  1. The commit in question changes cc_shared_library validation, and shouldn't practically change actions or command lines. The broken target is a genrule, and the underying error description (I think) is:
terminate called after throwing an instance of 'google::protobuf::FatalException'
--
  | what():  CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):
  | /bin/bash: line 1: 87712 Aborted

I'm unsure how this could be related...

  1. It looks like Tensor succeeded with Bazel@HEAD on Aug 3, which was well after dadc49e (Jun 10) ? https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/2573

I'm rerunning the culprit finder, just in case this was a flake. If I get a similar result pointing to that commit, I'll need to dig much deeper.

c-parsons avatar Sep 19 '22 15:09 c-parsons

It looks like Tensor succeeded with Bazel@HEAD on Aug 3, which was well after https://github.com/bazelbuild/bazel/commit/dadc49e437018f482640ed76fae5307daf9911a8 (Jun 10) ?

Yes, that's why we didn't discover this earlier. TensorFlow started to use cc_shared_library recently. I believe it's newer commits in TensorFlow that are broken with this change.

meteorcloudy avatar Sep 19 '22 17:09 meteorcloudy

I can probably help you bisect TensorFlow to find out which change is incompatible with https://github.com/bazelbuild/bazel/commit/dadc49e437018f482640ed76fae5307daf9911a8

meteorcloudy avatar Sep 19 '22 17:09 meteorcloudy

A known good TF commit is a3182a1eb28b5b6d081391669db3546325b5f1f1, and a known bad TF commit is 322c0a555914006635603c51f39d073ce83cd368

meteorcloudy avatar Sep 19 '22 17:09 meteorcloudy

Bisecting on my gLinux machine with:

$ docker run -d -it gcr.io/bazel-public/ubuntu2004-java11
3d0814114fd76a1013f8f78028da45fd9b07ad56d7d55be2df44e365f0e50388
pcloudy@pcloudy:~
$ docker exec -it 3d0814114fd76a1013f8f78028da45fd9b07ad56d7d55be2df44e365f0e50388  /bin/bash
root@3d0814114fd7:/# mkdir workdir
root@3d0814114fd7:/# cd workdir/
root@3d0814114fd7:/workdir# git clone https://github.com/tensorflow/tensorflow.git
...
root@3d0814114fd7:/workdir# cd tensorflow/
root@3d0814114fd7:/workdir/tensorflow# export USE_BAZEL_VERSION=dadc49e437018f482640ed76fae5307daf9911a8
root@3d0814114fd7:/workdir/tensorflow# pip3 install portpicker keras_applications keras_preprocessing future packaging dataclasses
...
root@3d0814114fd7:/workdir/tensorflow# ./configure
...
root@3d0814114fd7:/workdir/tensorflow# git bisect start d7f0308d5c2eb647d7348ee1eacd2042c0504b27 ae909b0995845407a8fc77e16726a6314386544a
Bisecting: 1193 revisions left to test after this (roughly 10 steps)
[a3182a1eb28b5b6d081391669db3546325b5f1f1] [xla:mlir] NFC: Use std::string_view instead of llvm::StringRef
root@3d0814114fd7:/workdir/tensorflow# git bisect run bazel build //tensorflow/tools/pip_package:build_pip_package

meteorcloudy avatar Sep 19 '22 17:09 meteorcloudy

Indeed, the second culprit finder still points to that commit.

Would be interested in the results of your bisect. Sorry, it's been a bit of a day with unrelated issues. I can look into this more actively tomorrow.

c-parsons avatar Sep 19 '22 20:09 c-parsons

The bisect is taking longer than I expected, it was interrupted by unrelated build failures.

meteorcloudy avatar Sep 20 '22 11:09 meteorcloudy

I can confirm I can reproduce this issue at https://github.com/tensorflow/tensorflow/commit/007f31c872c9c74f1298b7389d324e3947ede5ce with Bazel at dadc49e437018f482640ed76fae5307daf9911a8, but not at the parent commit of https://github.com/tensorflow/tensorflow/commit/007f31c872c9c74f1298b7389d324e3947ede5ce

meteorcloudy avatar Sep 20 '22 16:09 meteorcloudy

I'm having issues getting even a clean build passing with tensorflow, either at HEAD or at https://github.com/tensorflow/tensorflow/commit/007f31c872c9c74f1298b7389d324e3947ede5ce (with bazel 5.3.1) :\

In both cases:

bazel build -c opt --output_filter=^$ --test_env=HOME --test_env=BAZELISK_USER_AGENT --test_env=USE_BAZEL_VERSION -- //tensorflow/tools/pip_package:build_pip_package

Error at https://github.com/tensorflow/tensorflow/commit/007f31c872c9c74f1298b7389d324e3947ede5ce:

ERROR: /usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/external/boringssl/BUILD:130:11: Compiling src/third_party/fiat/curve25519.c failed: (Exit 1): gcc failed: error executing command /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 44 arguments skipped)
external/boringssl/src/third_party/fiat/curve25519.c:511:57: error: argument 2 of type 'const uint8_t[32]' {aka 'const unsigned char[32]'} with mismatched bound [-Werror=array-parameter=]
  511 | int x25519_ge_frombytes_vartime(ge_p3 *h, const uint8_t s[32]) {
      |                                           ~~~~~~~~~~~~~~^~~~~
In file included from external/boringssl/src/third_party/fiat/curve25519.c:41:
external/boringssl/src/third_party/fiat/internal.h:117:58: note: previously declared as 'const uint8_t *' {aka 'const unsigned char *'}
  117 | int x25519_ge_frombytes_vartime(ge_p3 *h, const uint8_t *s);
      |                                           ~~~~~~~~~~~~~~~^
external/boringssl/src/third_party/fiat/curve25519.c:831:57: error: argument 2 of type 'const uint8_t *' {aka 'const unsigned char *'} declared as a pointer [-Werror=array-parameter=]
  831 | void x25519_ge_scalarmult_base(ge_p3 *h, const uint8_t *a) {
      |                                          ~~~~~~~~~~~~~~~^
external/boringssl/src/third_party/fiat/internal.h:125:56: note: previously declared as an array 'const uint8_t[32]' {aka 'const unsigned char[32]'}
  125 | void x25519_ge_scalarmult_base(ge_p3 *h, const uint8_t a[32]);
      |                                          ~~~~~~~~~~~~~~^~~~~
cc1: all warnings being treated as errors
Target //tensorflow/tools/pip_package:build_pip_package failed to build

Error at HEAD:

ERROR: /usr/local/google/home/cparsons/tensorflow-github/tensorflow/tensorflow/BUILD:1409:19: Executing genrule //tensorflow:tf_python_api_gen_v2 failed: (Exit 1): bash failed: error executing command /bin/bash -c ... (remaining 1 argument skipped)
2022-09-20 13:21:57.687575: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "/usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2.runfiles/org_tensorflow/tensorflow/python/tools/api/generator/create_python_api.py", line 22, in <module>
    from tensorflow.python.tools.api.generator import doc_srcs
  File "/usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2.runfiles/org_tensorflow/tensorflow/python/__init__.py", line 42, in <module>
    from tensorflow.python import data
  File "/usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2.runfiles/org_tensorflow/tensorflow/python/data/__init__.py", line 21, in <module>
    from tensorflow.python.data import experimental
  File "/usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2.runfiles/org_tensorflow/tensorflow/python/data/experimental/__init__.py", line 96, in <module>
    from tensorflow.python.data.experimental import service
  File "/usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2.runfiles/org_tensorflow/tensorflow/python/data/experimental/service/__init__.py", line 419, in <module>
    from tensorflow.python.data.experimental.ops.data_service_ops import distribute
  File "/usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2.runfiles/org_tensorflow/tensorflow/python/data/experimental/ops/data_service_ops.py", line 22, in <module>
    from tensorflow.python.data.experimental.ops import compression_ops
  File "/usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2.runfiles/org_tensorflow/tensorflow/python/data/experimental/ops/compression_ops.py", line 16, in <module>
    from tensorflow.python.data.util import structure
  File "/usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2.runfiles/org_tensorflow/tensorflow/python/data/util/structure.py", line 29, in <module>
    from tensorflow.python.ops import resource_variable_ops
  File "/usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2.runfiles/org_tensorflow/tensorflow/python/ops/resource_variable_ops.py", line 36, in <module>
    from tensorflow.python.framework import meta_graph
  File "/usr/local/google/home/cparsons/.cache/bazel/_bazel_cparsons/67a079ece8e55c9ef2890c54a8d15515/execroot/org_tensorflow/bazel-out/k8-opt/bin/tensorflow/create_tensorflow.python_api_tf_python_api_gen_v2.runfiles/org_tensorflow/tensorflow/python/framework/meta_graph.py", line 18, in <module>
    from packaging import version as packaging_version  # pylint: disable=g-bad-import-order
ModuleNotFoundError: No module named 'packaging'
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
ERROR: /usr/local/google/home/cparsons/tensorflow-github/tensorflow/tensorflow/lite/python/BUILD:68:10 Middleman _middlemen/tensorflow_Slite_Spython_Stflite_Uconvert-runfiles failed: (Exit 1): bash failed: error executing command /bin/bash -c ... (remaining 1 argument skipped)

Tensorflow seems green at HEAD, so I'm not sure what's up.

c-parsons avatar Sep 20 '22 17:09 c-parsons

cc1: all warnings being treated as errors

Ah, I was also having this strange error. Temporary hack is to

vim bazel-tensorflow/external/boringssl/BUILD

Then comment out the -Wall, -Werror flags in the BUILD file.

ModuleNotFoundError: No module named 'packaging'

I guess you'll have to run pip3 install portpicker keras_applications keras_preprocessing future packaging dataclasses

meteorcloudy avatar Sep 20 '22 17:09 meteorcloudy

Thanks a bunch! Have reproed the error successfully.

Still don't have any leads on the underlying failure -- it may take me a bit.

c-parsons avatar Sep 20 '22 21:09 c-parsons

This has been exceedingly hard to debug, both because Tensorflow's setup is very complicated, and that debug tools for tracking cc_shared_library dependencies could use some work.

I've made a little progress, and have a fix, but I'm not sure if it's the most principled solution.

Effectively, it seems that, with dadc49e, we collect static dependencies of cc_shared_library across any deps-like edges.

This results in a change to //tensorflow/python:_pywrap_tensorflow_internal.so (there are symbol differences before/after this change). But, there's no change to several other shared libraries, such as //tensorflow:libtensorflow_framework.so

By doing some print debugging and diffing, I tried to track down which link-static dependencies were new to this shared library as a result of the Bazel change. Several filegroup dependencies via deps and srcs Many of these added dependencies have no effect, but there must be at least one that does.

Amending the check in the offending change to:

if fieldname == "_cc_toolchain" or fieldname == "target_compatible_with" or fieldname == "srcs" or fieldname == "hdrs":
            continue

seems to make tensorflow's build pass successfully.

Still, I'm not sure if this is really a principled change. I feel like I still haven't found the true root cause of tensorflow's failure. While the proposed fix might work to fix tensorflow's build, it feels like a hacky and unprincipled solution.

c-parsons avatar Sep 23 '22 21:09 c-parsons

@c-parsons Thanks!

/cc @oquenchil Do you have any insight of this issue? How the aspect should work for cc_shared_library?

/cc @learning-to-play, who is working on TF's cc_shared_library migration.

meteorcloudy avatar Sep 26 '22 08:09 meteorcloudy

Hi Chris,

I agree with you that if we add more attribute names in the check, it would be getting a bit out of hand. Even hard coding "_cc_toolchain" does not seem ideal when looking at it again, I can imagine someone's custom C++ rule in Starlark placing the toolchain in an attribute called "_toolchain".

From the change description and attached issues I can't really tell what prompted this change in the first place. Maybe if we go back to the original problem, we can find a different way to do this that doesn't break TF.

oquenchil avatar Sep 26 '22 12:09 oquenchil

Looking at this a bit more I can imagine why you sent the change in the first place. First just to document what the problem is, let me rephrase the problem that Chris found during debugging.

The cc_shared_library implementation decides what to link statically based on the structure of the build graph, it uses the aspect to figure out that structure. It tries to link statically every library in the transitive closure of the direct deps of the shared library (i.e. roots) that is reachable without going through dynamic libraries. That's why we need the structure of the graph.

Chris' original change is correct and wrong at the same time. It makes libraries be linked statically that should be linked statically but weren't linked before (because his project has custom C++ rules with different attribute names which is a valid use case) but it also links libraries statically that shouldn't be linked statically.

The proposed fix with hard-coding the attribute names "srcs" and "hdrs" cuts the chain that builds the structure of the graph that propagates via genrules, filegroups, etc.. I think that chain can be cut earlier by using required_providers on the aspect definition. This will skip the genrule and filegroup altogether, it also forces the custom C++ rules to advertise the CcInfo provider, so Chris you might want to check that this works for your project too, not just for Tensorflow.

If it works, I think the whole line if fieldname == "_cc_toolchain" or fieldname == "target_compatible_with" or fieldname == "srcs" or fieldname == "hdrs": can be removed, including the field names that were already there: _cc_toolchain and target_compatible_with.

Note: There is still one edge case that a precompiled library coming from a filegroup or generated by a genrule isn't linked statically when it should, but if any project in the future bumped into that, they should wrap that target in a cc_import which would be the principled thing to do. This shouldn't happen with TF now since by using the old code they were definitely not relying on any filegroup or genrule being linked that anyway.

oquenchil avatar Sep 26 '22 14:09 oquenchil

Thanks for the summary! You are indeed correct on my project's motivations.

Using required_providers of CcInfo sounds like a good idea, and a more principled solution. Let me give this a try on our project and i'll get back to you.

c-parsons avatar Sep 26 '22 15:09 c-parsons

I've run into some issues using the required_providers approach. My understanding is that this will only propagate the aspect to targets which advertise their provider using provides =. This is only a subset of targets which propagate this provider.

For example, the native implementation of cc_import propagates CcInfo, but does not advertise using provides = [CcInfo]. As a result, the aspect will not propagate to cc_import.

We can get around this by assessing CcInfo in dep manually in the aspect. This is a bit of a hack, but i've confirmed it makes my project continue to work.

However, this seems to not fix the Tensorflow issue.

Briefly, the proposed solution is to change https://github.com/bazelbuild/bazel/blob/f6a99afbd61c965b6e6888239ae432c4bb9a564a/src/main/starlark/builtins_bzl/common/cc/experimental_cc_shared_library.bzl#L592-L603 to:

    for fieldname in dir(ctx.rule.attr):
        deps = getattr(ctx.rule.attr, fieldname, None)
        if type(deps) == "list":
            for dep in deps:
                if type(dep) == "Target" and GraphNodeInfo in dep and CcInfo in dep:
                    children.append(dep[GraphNodeInfo])

            # Single dep.
        elif type(deps) == "Target" and GraphNodeInfo in deps and CcInfo in deps:
            children.append(deps[GraphNodeInfo])

c-parsons avatar Sep 27 '22 16:09 c-parsons

I guess the reason is that the CcInfo approach didn't prevent some dependencies propagated via "srcs" and "hdrs" attributes because some targets in those attributes do have the CcInfo provider?

meteorcloudy avatar Sep 28 '22 08:09 meteorcloudy

I think one unsolved mystery is why linking more dependencies will cause a crash.

meteorcloudy avatar Sep 28 '22 08:09 meteorcloudy

Indeed, can we get any help from TF here to demystify their build?

I'm inclined to go with the hacky solution I outlined in https://github.com/bazelbuild/bazel/issues/16296#issuecomment-1256694591 for the time being. We can follow up with a more principled solution after we unblock the release. WDYT?

c-parsons avatar Sep 28 '22 14:09 c-parsons

It feels like we are missing a way to express this in Starlark. Chris how are your custom attribute names called? If we are going to hardcode hdrs and srcs, I'd prefer that we go back to the way it was before your change and include your attribute names in a list apart from "deps". Unless they don't sound very generic and are very custom to your project.

In any case, to unblock this it will definitely require one of those two hacky solutions while we come up with a better way.

oquenchil avatar Sep 28 '22 14:09 oquenchil

We use a number of custom Starlark rules in our graph (generally hidden away by macros). Attribute names include:

"root", "roots", "shared", "includes_deps", "system_libraries", "stub_target", "library_target", "root_target", "source_target"

I'm frankly not sure without further investigation whether we'd need to allowlist all of these attribute names -- it's possible we only need "root", "roots", and "shared".

If it's worth investigating the viability of this path (that a hacky allowlist is better than a hacky denylist), I'm happy to give this a whirl in my project. Let me know.

c-parsons avatar Sep 28 '22 15:09 c-parsons

So I was thinking about this yesterday some more and I think the right way and the way to do it from the very beginning is with the concept of aspect hints introduced by @scentini and @hlopko.

I said "concept" because we don't need to use the built-in support for aspect_hints. There is no need for individual targets of your custom rule to set the aspect hints. It would just be the rule definition that does this. So cc_shared_library can have a new provider PropagationHints that is public just like CcSharedLibraryInfo. In there you would put "root", "roots" and "shared" which we would read from the aspect implementation to decide whether we should merge the results.

If the PropagationHints provider is not present, then propagation would happen just like it happened before your original change, via the "deps" attribute. I should probably come back later to this and make sure that in the default case it only propagates via the "deps" attribute if it has CcInfo but that's a change in behavior that may break things and it's unrelated to your change, so I'll take care of this.

oquenchil avatar Sep 29 '22 09:09 oquenchil

Apologies, I don't quite follow the suggestion. How would we pass PropagationHints with "root", "roots", and "shared" to a cc_shared_library target? Presumably you're suggesting cc_shared_library would define PropagationHints which can be constructed with a string list of attribute names, and then use the value of a passed PropagationHints to know which attributes to propagate the aspect? But I'm missing the mechanism to pass PropagationHints.

c-parsons avatar Sep 29 '22 21:09 c-parsons

cc_shared_library built-in code defines and exposes the provider PropagationHints. Your custom rule can instantiate this provider which the graph structure aspect will be able to see.

In the PropagationHints instance that you create, you place in a list field the attributes of your custom rule through which you want propagation to happen. I think this separates responsibilities cleanly. For example, if you have a new attribute you want to add to the list, you wouldn't have to wait for a change to Bazel, you can just change what you put inside the PropagationHints provider.


    PropagationHints = provider(fields = {
          "attributes" = "Attributes through which the cc_shared_library aspect should propagate"
     })
    ....
    propagation_attributes = []
    if PropagationHints in target:    
        propagation_attributes.extend(target[PropagationHints].attributes)
    elif hasattr(ctx.rule.attr, "deps") and type(ctx.rule.attr.deps) != "Target": # Original code before dadc49e437018f482640ed76fae5307daf9911a8
        propagation_attributes.append("deps")

    for attribute in propagation_attributes:
        for dep in ctx.rule.attr[attribute]:
            if GraphNodeInfo in dep:
                 children.append(dep[GraphNodeInfo])
  .....

I didn't test the code but that's the idea.

oquenchil avatar Sep 30 '22 10:09 oquenchil

@c-parsons Does Pedro's proposal work for your use case?

meteorcloudy avatar Oct 05 '22 13:10 meteorcloudy

It probably does, but I haven't had the cycles to verify. I hope to get you an answer tomorrow. Sorry about this!

c-parsons avatar Oct 05 '22 20:10 c-parsons

@c-parsons No worries, we pushed the release cut another 2 weeks back (2022-10-24), so we'll have some extra time for this. That said, this is still one of the few hard release blockers for 6.0.

meteorcloudy avatar Oct 06 '22 08:10 meteorcloudy

OK, I have confirmed the approach described above fixes my project. Furthermore, after about another day of diagnosing and debugging, I've determined we would only need to add PropagationHints to a single additional rule, to support the attribute "root_target".

In light of this finding, we technically have a workaround: we can rename root_target of that rule to deps. Thankfully, this is a fairly easy refactoring as this rule is private to a macro.

So this begs the question: Do we want to fixforward with your PropagationHints suggestion, or rollback the original offending commit dadc49e ?

Issues with the rollforward suggestion One snag in taking the PropagationHints approach is that there isn't currently an easy and clean way to expose a namespace-scoped PropagationHints provider to users. Correct me if I'm wrong, but I think that users can't import "PropagationHints" as a symbol from experimental_cc_shared_library.bzl, they would need to refer to it as a global native symbol. At which point -- do we name this CcSharedLibraryPropgationHints ? Do we make this as a part of the cc_common Starlark module?

This requires more thought, IMO.

Issues with the rollback suggestion I'm admittedly not 100% confident about the analysis of my project on how limited the root problem is. Our push for the upstream commit dadc49e actually predates the private root_target rule. So, this is evidence that our rule / macro structure has changed significantly in the last 3 months. Thus, I have concerns that we'll need a principled, customizable solution to this problem, if not now then very soon.

My suggestion I suggest we rollback dadc49e to unblock 6.0, but raise this issue as something we need to get solved at very high priority (though not necessarily a 6.0 blocker!). It helps that my project follows Bazel@HEAD very closely, so that it's probably reasonable to solve this after the 6.0 cut.

What do you think?

c-parsons avatar Oct 07 '22 01:10 c-parsons