buck2 icon indicating copy to clipboard operation
buck2 copied to clipboard

genrule with output folder consumed by cxx_library

Open zjturner opened this issue 1 year ago • 17 comments

I want to have a genrule that outputs multiple cpp files to a directory, and then have those files consumed and all compiled by a cxx_library. I tried this:

genrule(
    name="foo-gen",
    srcs=["foo.json", "generate.py"],
    out = "foo-folder",   # This is the folder where all the output files go
    cmd = "python generate.py foo.json -o ${out}/foo"
)

cxx_library(
    name = "foo",
    srcs = glob(["$(output :foo-gen)/foo/*.cpp)"])

This doesn't work, I guess because you can't expand string parameter macros inside of a glob. The problem appears to be with the cxx rule, and not with the genrule, as the genrule can certainly create the directory and output multiple files there.

Is there anything that can be done to handle this use case (generate multiple files from one input, consume them from a dependent rule)?

zjturner avatar Dec 13 '23 18:12 zjturner

This seems to be basically the equivalent of "how do I do glob on an http_archive", which I somewhat addressed here: https://github.com/facebook/buck2/discussions/387#discussioncomment-7136698

I think you would need to write a custom rule that called cxx_library_impl and used dynamic dependencies to discover the list of files from the source directory. I'd been meaning to explore this kind of idea but didn't get too far before...

thoughtpolice avatar Dec 13 '23 20:12 thoughtpolice

If that’s the case, couldn’t cxx_library be taught to do it natively?

zjturner avatar Dec 13 '23 21:12 zjturner

Any attrs.source() can accept either a file, or a target that provides a DefaultInfo. DefaultInfo has a default_outputs field, and that's a plural; it's a list. So I wonder if you can just provide any old dependency with multiple default_outputs. Including the output of a dynamic dependency-using glob rule.

You can test that out before diving into dynamic deps using a rule that transforms some "srcs": attrs.list(attrs.source()) verbatim into DefaultInfo(default_outputs=ctx.attrs.srcs).

cormacrelf avatar Dec 14 '23 07:12 cormacrelf

Alas, apparently not. This gives "Expected a single artifact from root//scratch:id (platforms//:host#9f5b0576f0e10a13), but it returned 2 artifacts". I question what the point of default_outputs is if not for exactly this -- I reckon this is a great API for it.

# BUCK
load(":defs.bzl", "verbatim")

verbatim(
    name = "id",
    srcs = [ "one.cpp", "two.cpp" ],
)

cxx_library(
    name = "lib",
    srcs = [":id"]
)

# defs.bzl
verbatim = rule(
    impl = lambda ctx: [ DefaultInfo(default_outputs = ctx.attrs.srcs) ],
    attrs = { "srcs": attrs.list(attrs.source()) },
)

Edit: This API is so apt, I think it is actually meant to work:

https://github.com/facebook/buck2/blob/37c29004bcc5e1d58d2b9b3d93566c50d42ae224/app/buck2_analysis/src/attrs/resolve/configured_attr.rs#L82-L83

https://github.com/facebook/buck2/blob/37c29004bcc5e1d58d2b9b3d93566c50d42ae224/app/buck2_analysis/src/attrs/resolve/configured_attr.rs#L103-L105

cormacrelf avatar Dec 14 '23 08:12 cormacrelf

Ok, let's say I change my python script so that it takes multiple inputs on the command line, and generates multiple outputs. 1 input -> 1 output. Then how do I have cxx_library() consume it?

It takes a list of json files on the command line of the form <name>.json, and an output directory, and it outputs a bunch of files of the form ${OUTDIR}/<name>.json.cpp

Can someone post a genrule + cxx_library() invocation that can generate and consume this?

genrule(
    name="foo-gen",
    srcs=["generate.py"] + glob(["*.json"]),
    out = "foo-folder"
    cmd = "python ${srcs} -o ${out}/foo"
)

cxx_library(
    name = "foo",
    srcs = [":foo-gen"]
)

Doesn't work. Shouldn't this case be supported since we know the static list of generated files?

zjturner avatar Dec 14 '23 18:12 zjturner

I could also have multiple genrules, one for each generated input/output pair. But I can't get cxx_library() to consume that either.

It seems like cxx_genrule() is really intended for this purpose, but I don't know why it has to take all of the command line stuff directly instead of just being able to use the configured toolchain like everything else.

zjturner avatar Dec 14 '23 18:12 zjturner

I can't see a way to get genrule to do it. Then again I have almost never seen fit to use genrule -- it is just not very good at anything. I find myself typing out the boilerplate for a regular rule in a .bzl file, and ctx.actions.running the shell command there. In this case, genrule can't do it because there is no way to describe to genrule that the list of outputs is 1:1 with the srcs, and each is named .cxx instead of .json.

That being said, this is quite a lot of code. The crux of it is the first few lines of the _generated_cxx function.

Requires a buck2 with #522.

(Note that if you change this to have ctx.actions.run in a for loop, one per input, then you get parallelism for free from the buck2 executor, whereas this way python does one at a time.)

.bzl, .py, BUCK

defs.bzl:


def _generated_cxx(ctx: AnalysisContext) -> list[Provider]:
    outputs = [
        ctx.actions.declare_output(
            src.short_path.removesuffix(".json") + ".cxx"
        )
        for src in ctx.attrs.srcs
    ]
    parent_dir_arg = []
    if outputs:
        parent_dir_arg = cmd_args("--out-dir", outputs[0].as_output()).parent()

    ctx.actions.run(
        cmd_args(
            ctx.attrs.exe[RunInfo],
            parent_dir_arg,
            ctx.attrs.srcs,
        ).hidden(
            [ o.as_output() for o in outputs ]
        ),
        category = "generated_cxx"
    )
    return [ DefaultInfo(default_outputs = outputs) ]

generated_cxx = rule(
    impl = _generated_cxx,
    attrs = {
        "exe": attrs.exec_dep(providers = [RunInfo]),
        "srcs": attrs.list(attrs.source()),
    }
)

generate.py:

#!/usr/bin/env python3

import argparse
from pathlib import Path
from typing import NamedTuple, Optional


class Args(NamedTuple):
    out_dir: Optional[Path]
    inputs: list[Path]


def arg_parse():
    parser = argparse.ArgumentParser()
    parser.add_argument("--out-dir", type=Path)
    parser.add_argument("inputs", type=Path, nargs=argparse.REMAINDER)
    return Args(**vars(parser.parse_args()))


args = arg_parse()

for input in args.inputs:
    output = args.out_dir / input.with_suffix(".cxx").name
    with open(output, "w+") as f:
        f.write(f"int {input.stem}() {{ return 5; }}\n")

And the BUCK file

load(":defs.bzl", "generated_cxx")

# uses the python interpreter from toolchains//:python_bootstrap
python_bootstrap_binary(
    name = "generate.py",
    main = "generate.py",
)

# this has multiple outputs
generated_cxx(
    name = "foo-gen",
    exe = ":generate.py",
    srcs = glob(["*.json"]),
)

cxx_library(
    name = "lib",
    # :foo-gen is expanded to make the `srcs` list as long as the glob was
    srcs = [ ":foo-gen" ]
)

cormacrelf avatar Dec 19 '23 12:12 cormacrelf

It’s very strange that there is no rule combination out of the box that can do this. Doesn’t anything at Meta use (for example) protobufs? How would this work?

zjturner avatar Dec 19 '23 16:12 zjturner

Thanks, this is very helpful. I'm working with @zjturner on the same project. Based on your example I've gotten something working:

def _rbx_openapi_impl(ctx = "context"):
    base_folder = cmd_args(ctx.actions.declare_output("logs").as_output()).parent()
    out_folder = cmd_args(base_folder, format="{}/out")
    logs_folder = cmd_args(base_folder, format="{}/logs")

    generated_includes_folder = ctx.actions.declare_output("out/include")
    generated_srcs = [ctx.actions.declare_output("out/src", generated_src) for generated_src in ctx.attrs.generated_srcs]

    command = cmd_args(["python3", "Client/OpenApi/BuildScripts/generate.py", ctx.attrs.json, 
        "-l", logs_folder,
        "-o", out_folder,
        ] 
    )
        
    command.hidden([generated_src.as_output() for generated_src in generated_srcs])
    command.hidden(generated_includes_folder.as_output())

    ctx.actions.run(command, category="openapi")
    
    return [DefaultInfo(
        default_outputs = generated_srcs,
        sub_targets = {
            "generated_includes_folder" : 
                [DefaultInfo(
                    default_outputs = [generated_includes_folder]
                )]
        }
    )]

rbx_openapi = rule(
    impl = _rbx_openapi_impl,
    attrs = {
        "json" : attrs.source(),
        "generated_srcs": attrs.list(attrs.string(), default=[]),
    },
)

rbx_openapi(
    name = "assets-upload-api",
    json = "content/specs/assets-upload-api.json",
    generated_srcs = [
        "Asset.cpp",
        ],
)

cxx_library(
    name="openapi",
    srcs=[":assets-upload-api"],
    public_include_directories=[
        "include",
        ],
   exported_preprocessor_flags = [ 
    "-I$(location :assets-upload-api[generated_includes_folder])",
    ]
)

Problems:

Getting the output folder

It's really awkward to get the base folder to pass into the script. I have to have an output in the base folder that I can then call parent() on. In my case, the output files are sometimes in nested folders, so I can't rely on calling parent on one of them. The script happens to create a logs folder, so I can use that, but it's a weird way of getting there.

The most intuitive way would be to declare the root output folder as an output itself, but if I do that, then I get an error message about the root output folder conflicting with the actual outputs - even if I declare the output folder as being a directory. I'm not sure what use it is to declare an output as a directory if you can't then declare any outputs that are inside that directory. Is there some other way I'm missing?

Getting the includes folder

Since the script generates includes that need to be exported to targets that link the library, I need to get the include folder and pass it along. I can return the folder as a nested target, but if I pass :assets-upload-api[generated_includes_folder] to public_include_directories it doesn't get converted to the folder name (instead it just adds the literal :assets-upload-api string to the include paths).

Also, eventually (I think) I'm going to need to export the individual include files. I can do that similarly to how I return the source files, but I can't return both the generated include folder and the generated includes inside of it as outputs, since they conflict as I mentioned above.

Dynamic dependencies

In the case I'm working on, a single JSON file generates a number of source files, so I'm passing the expected source files in as generated_Srcs. From what I've read, it seems that I could use dynamic dependencies to get the list of source files instead. I would just need to have a way to generate a file with the source file names. Is that correct, or is there something about dynamic dependencies that would make that not work?

thanks, chris

AtomicOperation avatar Dec 19 '23 17:12 AtomicOperation

Actually, looking at the dynamic dependencies docs more closely, it seems that all outputs have to be declared outside the nested lambda function, so I don't think it's possible to use dynamic dependencies to generate the expected output files.

AtomicOperation avatar Dec 19 '23 17:12 AtomicOperation

Yes, technically you can write a file that lists other files, and then is passed as an arg file to the cxx compiler via compiler_flags = ["@$(location :gen)"] can make buck compile all those files. But there is a big caveat -- I don't believe those cpp files will be uploaded to RE when compiling the cxx_library, because buck doesn't know about them. It will just upload the arg file and then fail for missing cpp files.

(If you want to hack it together locally, you can't make such a file using genrule because genrule has been forbidden from seeing GEN_DIR (in order to write paths that any other rule could use) -- grep prelude for GEN_DIR_DEPRECATED. Probably for the reason above. A normal Python program can do it. Your cxx_library needs a single empty dummy file as srcs and then it will compile all the ones in the arg file.)

If what I said is accurate, the best approach is to "buckify" the OpenAPI spec to extract a list of filenames it will generate. You write a little Python program that reads json and, surprisingly, writes more json but to a .bzl file, turning on indentation, and wrapped like so:

# to regenerate, run XXX
manifest = {
    "files": [
        "A.cpp",
        "B.cpp"
    ]
}

Then:

load(":manifest.bzl", "manifest")
generate_whatever(
    name = "gen",
    json = ...,
    outputs = manifest.files,
)
cxx_library(
    srcs = [":gen"],
)

Also could probably hash the input json and store the hash in the manifest, to remind you to rebuckify later.

cormacrelf avatar Dec 19 '23 21:12 cormacrelf

I don't believe those cpp files will be uploaded to RE when compiling the cxx_library, because buck doesn't know about them

I think Buck2 will know about them. The $location grabs them. This should be RE safe.

I'm not really sure about cxx_genrule at all. I'm not massively sure what it is used for.

I'm not sure it's possible to have a genrule generate non-statically-known C++ files and then use cxx_library to compile them. There are two options:

You do know the filenames statically

If so, even if genrule creates them, you can lift them up to separate named outputs. You can do that with something like:

genrule(
   name = "foo",
   cmd = ...,
   outs = {"a":  "a.cpp", "b": "b.cpp"},
)

cxx_library(
   name = "bar",
   srcs = ["$(location :foo[a])", "$(location :foo[b])"]
)

If that works then maybe default_outs would let you simplify the cxx_library.

You don't know the filenames statically

This is much harder. You certainly could write a cxx_library that worked with a directory, but once you have an unknown directory you can't have the filenames ever pop up in providers again. That means you can't have a list of object files in a provide, but would have to have the library bundle them into an archive, or leave them in a directory that you then passed onwards.

As to how we do it at Meta - we always have the list of filenames statically available. Note that even if a tool generates multiple C++ files, you can usually just cat them together to make a single C++ file and then not worry about it. We've definitely done that too.

ndmitchell avatar Dec 19 '23 22:12 ndmitchell

Thanks! The dynamic source part isn't that critical, I was more curious to make sure I understood what was possible. Specifying the files statically makes sense.

Do you have any thoughts about better ways to get the output folder when running the rule or how to return the include folder so it can be used by targets that depend on it (from my comment above)? The feature/fix requests from me would be:

  • Allow specifying a directory as an output while also having output files underneath it (including being able to do something like declare_output("/") to get the root output path)
  • Update public_include_directories and similar attributes to support $(location :target[subtarget]) or something similar so a generated include path can be used by dependencies

I think that would fix the problems I've had to work around, but I'm not sure whether these ideas fit into the design of buck2.

AtomicOperation avatar Dec 20 '23 14:12 AtomicOperation

One solution that i think might even be the best solution is to create a new rule like generated_cxx. I think if you have it return a CPreprocessorInfo, then include path inheritance will automatically work.

zjturner avatar Dec 20 '23 16:12 zjturner

@AtomicOperation I think public_include_directories is a bit of a hack. I think if you use header_dirs instead it will let you specify a $location. Maybe that's what @zjturner is after with generated_cxx ?

Not sure what you mean by "or something similar so a generated include path can be used by dependencies" but maybe that isn't required with header_dirs?

ndmitchell avatar Dec 20 '23 17:12 ndmitchell

I suspect he means that if the genrule is :foo and it generates sources and headers, and:bar is a cxx_library that compiles the generated sources, and :baz is a cxx_library that depends on :bar, then :baz may have a transitive dependency on header files in the output directory of :foo.

Haven’t tested, but i think if a custom rule returns a CPreprocessorInfo, this all “just works”, because cxx_library should consume it automatically

zjturner avatar Dec 20 '23 17:12 zjturner

Custom rule returning CPreprocessorInfo or header_dirs should both work for that.

ndmitchell avatar Dec 20 '23 22:12 ndmitchell