bazel icon indicating copy to clipboard operation
bazel copied to clipboard

Absolute path inclusion validation breaks C++26's `#embed` on Clang / macOS

Open burnpanck opened this issue 10 months ago • 15 comments

Description of the bug:

Clang ships the C++26 resource inclusion using #embed since version 19. This feature allows to include raw binary content directly into the compilation output. While the standard just says that the name given to the #embed directive is used in an "implementation defined" way to lookup the resource, in clang, it effectively is a path relative to the source file. Yet, bazel fails the build, claiming that a file was included by an absolute path:

ERROR: /.../embed-issue-project/BUILD.bazel:1:8: Compiling embed-issue.cpp failed: absolute path inclusion(s) found in rule '//:embed-issue':
the source file 'embed-issue.cpp' includes the following non-builtin files with absolute paths (if these are builtin files, make sure these paths are in your toolchain):
  '/.../embed-issue-project/embed-issue.dat'

Which category does this issue belong to?

C++ Rules

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Prepare the following project files:

MODULE.bazel

module(
    name = "embed-issue",
    version = "0.1.0",
)

bazel_dep(name = "rules_cc", version = "0.0.17")
bazel_dep(name = "toolchains_llvm", version="1.4.0")
llvm = use_extension("@toolchains_llvm//toolchain/extensions:llvm.bzl", "llvm")
llvm.toolchain(
    cxx_standard = {"": "c++2c"},
    llvm_version = "20.1.2",
)
use_repo(llvm, "llvm_toolchain")
register_toolchains("@llvm_toolchain//:all")

BUILD.bazel

cc_test(
    name = "embed-issue",
    srcs = [
        "embed-issue.cpp",
    ],
    data =[
        "embed-issue.dat",
    ],
)

embed-issue.cpp

#include <print>

static const unsigned char data[] = {
#embed "embed-issue.dat"
};

int main() {
  std::println("Embedded data: {}",data);
  return 0;
}

embed-issue.dat

hello-world

Then, run

bazel test :embed-issue

Which operating system are you running Bazel on?

macOS

What is the output of bazel info release?

release 8.2.0-homebrew

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?


If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

Searching neither StackOverflow nor previous issues on GitHub turned up anything that seems to relate to #embed.

Any other information, logs, or outputs that you want to share?

No response

burnpanck avatar May 25 '25 14:05 burnpanck

Note that #14346 reported inclusion validation issues due to absolute paths in clang's .d files. That issue was closed after inclusion validation checking was disabled for Obj-C, even though the issue extends beyond just Obj-C. I was experiencing that bug before despite it being marked closed. I believe the conditions were related to layered libraries with transitive header inclusions, and perhaps a combination of includes= and strip_include_prefix=. Perhaps this one here is just another case of the same thing: .d files may contain absolute paths that do not refer to system source, but project sources. Bazel should just do the right thing, and decide based on actual file location, instead of blindly discarding absolute paths.

burnpanck avatar May 25 '25 14:05 burnpanck

it would be useful to reproduce outside of bazel using clang and relative paths. IIUC right now the logic in bazel is just that it expects all paths (besides system include paths) in the output .d files to be relative because of -no-canonical-prefixes, if clang happens to ignore this setting for this new use case, it makes sense that it would lead to this failure

keith avatar Jun 05 '25 17:06 keith

@keith what exactly do you expect to see? I just checked the .d file generated by the bazel project above. Indeed, there is an absolute path to that to-be-embedded data file. Exactly as the error message says. If you insist, here is a reproducer just using clang (both apple clang 17 and homebrew clang 20; both on macOS):

> clang -MM embed-issue.cpp  
embed-issue.o: embed-issue.cpp \
  /Users/yves/temp/bazel-embed-issue/embed-issue.dat

(to make the code compile and run, for me clang --std=c++23 -lc++ embed-issue.cpp does the trick).

IMHO the bazel logic you mention is just plain broken. To me, absolute vs. relative paths is an implementation detail, two different ways to represent the same information that shouldn't impact any of the build outputs. In fact, because relative paths hide implicit information, I would argue that relative paths should be avoided as much as possible when referring to a concrete file.

Of course one can try to convince the clang people to output the embed file as a relative path into the .d file. But what is the argument here? "Our tool uses absolute vs. relative paths to distinguish between two different sets of files, so please use relative paths here"? As long as there isn't some kind of API specification on .d files which enshrines this exact meaning that bazel deduces from the difference between the two, basing an inclusion check that is not accessible to the rule-writer on an implementation detail is just a plain foot-gun (one, where the trigger is stuck)

burnpanck avatar Jun 06 '25 12:06 burnpanck

what exactly do you expect to see?

if you use #embed and also include a header, you should see that the header path is relative, and the embed path is absolute, this will require passing -no-canonical-prefixes like clang does. if they are both relative than we're misunderstanding the issue.

since in your example above you're not passing -no-canonical-prefixes, it's no surprise it's absolute. you can try the variations of -fcompilation-dir . or -ffile-prefix-map $PWD=. etc to see if any of those affect these paths as well.

I would argue that relative paths should be avoided as much as possible when referring to a concrete file.

probably not worth arguing too much about this since bazel has chosen this path and it is unlikely to change at that level. I imagine a potential fix could be that if the prefix of the path is the workspace root it's also allowed, im not sure what logic exists around that. but that would mean the .d files would be non-hermetic as well, which isn't immediately a problem since no other action consumes them, but still might cause issues, not sure.

But what is the argument here?

i think the general argument is that paths should be embedded in that file as they are passed, in order to support transient symlinks. For this specific example I guess I would use the argument that all other file paths are relativized with -no-canonical-prefixes, so this one being the odd one out seems like a bug.

keith avatar Jun 06 '25 19:06 keith

Exactly as expected; no matter if with or without -no-canonical-prefixes, the header using quotes (") comes with a relative path, but the #embed with an absolute one:

> /opt/homebrew/opt/llvm/bin/clang++ -MM -no-canonical-prefixes -ffile-prefix-map=$PWD=. embed-issue.cpp
embed-issue.o: embed-issue.cpp embed-issue.hpp \
  /Users/yves/temp/bazel-embed-issue/embed-issue.dat

-fcompilation-dir is not understood by my clang, and -ffile-prefix-map didn't make a difference either.

(now, embed-issue.cpp reads as follows, to avoid having to type system directories)

// #include <print>

#include "embed-issue.hpp"

static const unsigned char data[] = {
#embed "embed-issue.dat"
};


int main() {
//  std::println("Embedded data: {}",data);
  return 0;
}

For this specific example I guess I would use the argument that all other file paths are relativized with -no-canonical-prefixes, so this one being the odd one out seems like a bug.

I'm not sure that this will make you happy. At least in clang, #embed follows different rules than #include; the former is resolved with respect to --embed-dir, where the latter with respect to the usual includes search path. But within the .d file, you can't tell the difference between a file pulled in through #embed and one pulled in through #include. So with relativized paths, bazel won't be able to correctly deduce what file will be chosen by clang. In turn, bazel's view of the dependency chain is wrong, leading to wrong caching decisions when dependencies change.

probably not worth arguing too much about this since bazel has chosen this path and it is unlikely to change at that level.

The fundamental issue here is simply that bazel's data-model is just wrong for the task at hand. Dependencies are relations between specific files. The canonical way to identify a specific file on a file-system is an absolute path. In bazel's current data-model, those files are instead identified through the combination of an ordered list of inclusion paths and a relative path to be searched within that ordered list. But the .d files omit half of that information (the set of relevant inclusion paths). Thus, with the data-model chosen, .d files were never a correct way to communicate dependencies.

burnpanck avatar Jun 06 '25 21:06 burnpanck

FWIW I wouldn't be the right person to convince to change anything about the .d model, and practically speaking it seems very unlikely to change. But I am also interested in finding a solution here and I still think the most likely path forward is to make clang emit relative paths as it does for headers.

but 2 notes about your examples:

the include paths + relative conflicting header paths isn't an issue because the compiler emits the header path including the chosen include directory:

.
├── a
│   └── a.h
├── b
│   └── a.h
└── main.c
#include "a.h"

int main(){}
% clang -MM main.c -Ia -Ib
main.o: main.c a/a.h

so that is correct and takes include path ordering and conflicting header names into account.

The canonical way to identify a specific file on a file-system is an absolute path

I think bazel just has the luxury of not having to worry about this because of all the baked in assumptions and validations around what files can potentially end up in the symlink tree, given its hermiticity model. Also I believe bazel still pushes these files to the remote cache, so ofc the paths in it have to be relative or it will be incorrect for anyone else who didn't share the absolute path to your repo's checkout

keith avatar Jun 06 '25 22:06 keith

the include paths + relative conflicting header paths isn't an issue because the compiler emits the header path including the chosen include directory

Ah, I wasn't aware of that. So bazel may be able to talk it's way out of its design choice here.

and practically speaking it seems very unlikely to change.

Yeah, I guess that must be true. I'll see if I have any success with clang.

burnpanck avatar Jun 06 '25 22:06 burnpanck

Note that if you do provide --embed-dir with a relative path, and the resource is found via that path, then (at least my version of) clang will write a relative entry into the .d.

Maybe that's enough for you to work around the issue for now?

pcjanzen avatar Jun 07 '25 00:06 pcjanzen

Thanks for the hint, indeed, with --embed-dir in the cxxopts, I was able to get it to work. Another gotcha is that data= is the wrong attribute, because it only applies to runtime files; srcs= rejects my data files, but hdrs= is ok.

burnpanck avatar Jun 10 '25 11:06 burnpanck

hdrs will make it also available in downstream actions, perhaps additional_compiler_inputs (or use of implementation_deps) would be preferable.

pcjanzen avatar Jun 10 '25 14:06 pcjanzen

additional_compiler_inputs is intended for this. What did you set embed-dir to?

keith avatar Jun 10 '25 15:06 keith

In my actual use-case, the data files are in a sub-folder data/embed relative to the BUILD.bazel file in question, which happens to be the project top-level. I put cxxopts=["--embed-dir=data/embed/"] and then use #embed "data-file.dat".

burnpanck avatar Jun 10 '25 16:06 burnpanck

Thanks for the hint of additional_compiler_inputs; I looked for something like that, but couldn't find it.

burnpanck avatar Jun 10 '25 16:06 burnpanck

that actually doesn't seem to help in my test:

% clang++-20 -std=c++23 -MM main.cc -no-canonical-prefixes --embed-dir=. -ffile-compilation-dir=.
main.o: main.cc /tmp/repro/embed-issue.dat

does the repro case from above no longer repro if you use --embed-dir like this?

keith avatar Jun 10 '25 16:06 keith

Either use <> in the #embed statement or move the file into a different directory.

The problem is that clang has an implicit element at the start of the embed-dir search path that's used for double-quote lookups that's initialized to the absolute path of the corresponding source file.

pcjanzen avatar Jun 10 '25 16:06 pcjanzen