pants Code Quality Tool: toml-based backend templating

This demonstrates a version of the suggestion in https://github.com/pantsbuild/pants/issues/17729#issuecomment-1843094371

The PR does two things:

introduce a mechanism for backend templating
use that mechanism to provide a simple way to configure code_quality_tool backends. Also adds docs and validation for code_quality_tool

Overview/Demo

It introduces the notion within bootstrapping that backends can be templated out from a package:

[GLOBAL]
backend_packages.add = [
    "pants.backend.python",
    "flake8",
    "markdownlint",
]

[GLOBAL.templated_backends.flake8]
template = "pants.backend.experimental.adhoc.code_quality_tool_backend_template"
goal = "lint"
target = "//:flake8_linter"
name = "Flake8"

[GLOBAL.templated_backends.markdownlint]
template = "pants.backend.experimental.adhoc.code_quality_tool_backend_template"
goal = "lint"
target = "//:markdownlint_linter"
name = "Markdown Lint"

Working example

See https://github.com/gauthamnair/example-pants-byo-tool/tree/global-backends-option-demo for a repo with a working example of using this functionality to activate code quality tool without needing an in-repo plugin.

Backend templating

We adjust the meaning of backend_packages to include either:

a str specifying a python package to be loaded as before, containing a register.py with entrypoints that take no arguments
OR a str referring to a templated backend defined as a key in [GLOBAL.templated_backends]. See below how they get loaded.

Backends are loaded in order, as before.

A templated backend is configured with toml following this shape:

[GLOBAL.templated_backends.<backend_package_name>]
template = "path.to.backend.generator.module"
kwarg1 = "val1"
kwarg2 = "val2"

To load it, pants will import the python module specified in the template arg. It will call the generate function on it to produce a register-module-like object:

# pseudocode equivalent to the load_backend impl in this PR
from path.to.backend.generator.module import generate

obj = generate(<backend_package_name>, dict(kwarg1="val1", kwarg2="val2"))

Then pants will look for and call the usual backend registration entry points but on obj: obj.rules(), obj.target_types(), etc.

Code Quality Tool

PR makes these changes:

Code Quality Tool backends are now available via the backend template pants.backend.experimental.adhoc.code_quality_tool_backend_template. The backends it stamps out are self-contained and does not require the adhoc backend to be separately added to backend_packages
Added several bits of documentation, argument validation and better error messages for code quality tool users.

Dec 07 '23 13:12 gauthamnair

@kaos I am going to try to begin adding tests here.

Dec 13 '23 20:12 gauthamnair

@kaos , I've become uncomfortable with a part of the proposed API for 'templated backends'. Config args are currently represented as plain extra args to the entrypoints in register.py. So looks like:

def rules(backend_package_alias: str, goal: str, target: str, name: str):
    cfg = CodeQualityToolRuleBuilder(
        goal=goal,
        target=target,
        name=name,
        scope=backend_package_alias,
    )
    return cfg.rules()

This requires using things like inspect as you suggest to figure out what the arguments are supposed to be in case the user does not supply them. If there was a type-error we would have to dig even deeper to provide a good error message.

It seems it would be better if the package advertised the config args needed to template it as a structure. For example:

config = CodeQualityToolBackendConfig

@dataclass
class CodeQualityToolBackendConfig:
    goal: str
    target: str
    name: str

def rules(backend_package_alias: str, config: CodeQualityToolBackendConfig):
    ...

In addition to the optional attributes like rules, target_types, etc. a backend used for templating would have an extra required attribute, above proposed to be config. The extension loader/bootstrapper could inspect if the kwargs supplied in pants.toml are fine and then construct the config using the type by some standard procedure. It should also be able to provide helpful error messages.

As to how to express the types, I was trying to make something like:

class CodeQualityToolBackendConfig:
    goal = StrOption(
        help=f"The pants goal implemented by this backend. Must be one of {supported_goal_names}",
        default=None
    )
    target = TargetOption(
        help=f"The address of the {CodeQualityToolTarget.alias} target",
        default=None
    )
    name = StrOption(
        help="Name of this backend to show in pants help",
        default=None
    )

work because we already have all this language around expressing configuration and help for it, but using these kinds of Option seem strongly tied into the Subsystem and general pants configuration workflow. I can't find a simple function to take a dictionary and validate it through such a class. Further the Option system is not designed to express things like 'required' arguments so well, because backends and subsystems should come with meaningful defaults.

Do you think it is worth worrying about this kind of thing at this stage? I am perhaps unreasonably worried that someone aside from code_quality_tool would use the backend templating mechanism even without it being widely advertised. Plus it is such a niche use-case.

Maybe a simpler, less declarative API like:

def parse_templating_args(kwargs: dict) -> T:
    # parses the kwargs from toml and templating backend author is at liberty to provide a helpful error message here

def rules(backend_package_alias: str, args: T) -> Iterable[Rule]:
    ...

Will fit well for now. I think I'll roll with something like that.

Dec 14 '23 15:12 gauthamnair

Actually, scratch that, been thinking about this in an unsound way. It is problematic to have separate methods rules, target_types etc that separately accept the configuration values. It is possible for the rules and targets from a templated backend to have common generated dependencies. Like if we generate a Target class definition a rule might refer to it somewhere, or at least to the involved fields.

Here is an API that would fix this fundamental oversight as well as encapsulate the parsing/validation of the kwargs:

Instead of supplying a package when templating, we supply the path to a module
The module must define a top-level function generate_backend with this signature:

def generate_backend(backend_package_alias: str, kwargs: dict) -> RegisterProtocol:
    # parses and validates kwargs
    # assembles an object that will build the backend
    ...

The returned object must satisfy the same protocol as register.py does with regular backends. It can have attributes like rules, target_types, etc. which must be zero-arg callables.

Will move the PR in this direction.

Dec 14 '23 17:12 gauthamnair

@kaos , the way the PR is currently one does have to refer to the configured backends still in backend_packages.

I think it will be fine to change the template key for module so that the example will look like:

[GLOBAL]
backend_packages.add = [
    "pants.backend.python",
    # the below are configured/templated backends and must still be referred to in the backend_packages list.
    "flake8",
    "markdownlint",
]

[GLOBAL.templated_backends.flake8]
module = "pants.backend.experimental.adhoc.code_quality_tool_backend_template"
# not template =  and not package = because the string refers to a module not a python package.
goal = "lint"
target = "//:flake8_linter"
name = "Flake8"

[GLOBAL.templated_backends.markdownlint]
module = "pants.backend.experimental.adhoc.code_quality_tool_backend_template"
goal = "lint"
target = "//:markdownlint_linter"
name = "Markdown Lint"

What do you think? I can make the s/template/module/ change.

Dec 16 '23 02:12 gauthamnair

@kaos , the way the PR is currently one does have to refer to the configured backends still in backend_packages.

Yep, and I like that. 👍🏽

What do you think? I can make the s/template/module/ change.

Yea, I like module =.

[GLOBAL.templated_backends.markdownlint]

Do you have any alternative suggestions for the templated_backends option?

I like the generate() method to return the backend "module" used to get all the rules, target types etc from. But I think we could still have the generate method to be declared in a register.py module (so it would still be closer in structure/semantics to a regular backend).

So:

[GLOBAL.templated_backends.markdownlint]
module = "pants.backend.experimental.adhoc.code_quality_tool"
...

When loading backends, pants would load pants.backend.experimental.adhoc.code_quality_tool.register and call generate("markdownlint", ...) on that to get the module used to load all the rules etc from. Wdyt?

Dec 16 '23 02:12 kaos

I'll dive into this this week, but damn. Looks absolutely fantastic on the surface.

You should feel proud, this is very well done.

Dec 17 '23 15:12 thejcannon

@kaos This comment takes your suggestion and see how to tie it together in a self-consistent way of talking/thinking about backends. What you are proposing is an interesting formulation.

Current proposal: two types of backends

What is in this PR at the time of this writing. There are two types of backends: regular and templated. We refer to their definition in code differently. One has a package with a register.py, the other points to a module with a generate inside of it.

Alternate proposal: unification of templated and regular backends

This is following your suggestion. We can propose that all backends are "generated"/"templated". The only difference with regular backends is that generate is implicit and trivial:

The entry point for installing any backend functionality is a python package with a register.py.
We consider all backends are "generated":
- If register.py defines generate we use that callable with generate(backend_alias, kwargs)
- If not, we consider an implicit generate that returns the loaded register.py module object itself and asserts that no kwargs were present in the configuration:

# equivalent to the implicit generate defined in every classic backend package
def generate(backend_package_alias, kwargs) -> RegisterProtocol:
    assert not kwargs, f"Backend package {__name__} does not take configuration args"
    return sys.modules[__name__]

In this view the templated_backends config section is something that applies to all backend instantiations. I like the unification that this would make happen, now that it seems the pieces can be tied together. And we'd use package in the toml.

I'd like to wait for another reviewer/maintainer to chime in as well and if they are +1 with moving to this direction, I'll go ahead and impl.

Dec 18 '23 11:12 gauthamnair

We can propose that all backends are "generated"/"templated"

Can you share a snippet of what that would look like in the toml?

Dec 18 '23 13:12 thejcannon

Sorry, @thejcannon , what I mean by:

We can propose that all backends are "generated"/"templated"

Is just a conceptual change, but does not require any change in toml. Its just saying that internally we can manage regular backends as though they too are "generated" via an implicit generate that takes only empty kwargs.

Dec 18 '23 14:12 gauthamnair

We adjust the meaning of backend_packages to include either:

a str specifying a python package to be loaded as before, containing a register.py with entrypoints that take no arguments

OR a str referring to a templated backend defined as a key in [GLOBAL.templated_backends]. See below how they get loaded.

[GLOBAL]
backend_packages.add = [
    "pants.backend.python",
    # the below are configured/templated backends and must still be referred to in the backend_packages list.
    "flake8",
    "markdownlint",
]

[GLOBAL.templated_backends.flake8]
module = "pants.backend.experimental.adhoc.code_quality_tool_backend_template"
# not template =  and not package = because the string refers to a module not a python package.
goal = "lint"
target = "//:flake8_linter"
name = "Flake8"

...

This looks awesome, but I've got a conceptual worry that nags at me. These are definitely not packages in any sense that overlaps with existing usage. I'm not even sure if they'd count as backends? Conceptually, they're jazzed up code generators. I don't say that to denigrate the work done - it's awesome, and Pants is unforgivingly verbose so I'm all for simplifying the process. But there's certainly a lack of smarts that separates them from regular backends. I'm worrying that by equating them we'll (a) make teaching Pants harder, (b) have a mismatch in expectation of what they'll do, and (c) potentially set us up for a lot of work to meet those expectations?

I'm wondering if there's a "missing" component in Pants here which lives between static configuration (pants.toml) and the BUILD files. For example; in Bazel you've got WORKSPACE which looks like build files, but runs earlier and is where you'd regularly register rules. bazel also has .bazelrc, and contains settings modulated by contexts. I've not kept up enough with the previous work in this series to know why we are where we are now, so maybe this has been discussed before.

As a more general point about this whole series, I'm worried how this would interact e.g. with code generators, or native Python extensions. Can that work? How does it work? What about Rust, which requires both rustup + toolchain before it can run, and the partioning has to happen on "package" or workspace level? On the other hand -- the way Pants handles source files + dependencies is quite inflexible, so this is a good use-case for finding those limits. I'm just wondering whether we can then translate that back into gains for general "non-trivial" file based runs.

Dec 18 '23 15:12 tgolsson

I'm wondering if there's a "missing" component in Pants here which lives between static configuration (pants.toml) and the BUILD files. For example; in Bazel you've got WORKSPACE which looks like build files, but runs earlier and is where you'd regularly register rules.

(Just a drive-by thought) The "Environments" support parses the root BUILD file without macro support to find environments. It could be that if we had something akin to WORKSPACE, the environments handling would move there as well.

Dec 18 '23 17:12 thejcannon

I'm wondering if there's a "missing" component in Pants here which lives between static configuration (pants.toml) and the BUILD files. For example; in Bazel you've got WORKSPACE which looks like build files, but runs earlier and is where you'd regularly register rules.

(Just a drive-by thought) The "Environments" support parses the root BUILD file without macro support to find environments. It could be that if we had something akin to WORKSPACE, the environments handling would move there as well.

Not just the root BUILD file.. it parses any BUILD file as required to load all defined environment targets.

Dec 18 '23 20:12 kaos

We can propose that all backends are "generated"/"templated".

Yes, we can attempt to always check for generate(..) when loading a backend, but perhaps only if we have kwargs. This would allow a backend to be usable either as a regular backend or as a templated backend.

These [I assume this refers to the aliases in the backend_packages list] are definitely not packages in any sense that overlaps with existing usage. I'm not even sure if they'd count as backends? Conceptually, they're jazzed up code generators.

I agree they're not packages. What if the global option was builtin_plugins rather than backend_packages? I think that is a better description, as it is a bit unfortunate that the current method of enabling the backends (plugins) are in an option tied to the implementation rather than the effect.

However, not being a package but rather a placeholder for a configured name that in turn points at a module to use to generate the backend to load--they indeed can be seen as jazzed up code generators, but not generic and always produce the same thing; the backend to load. So the new concept being that mixed with the regular backend packages there can now also be "backend aliases". As these aliases are named freely, it'll be up to the user to pick names that makes sense to them so I don't see this as a big problem, outweighed by the intuitive effect of being able to manipulate one single list for all "backends" to use.

Dec 18 '23 21:12 kaos

But there's certainly a lack of smarts that separates them from regular backends. I'm worrying that by equating them we'll (a) make teaching Pants harder, (b) have a mismatch in expectation of what they'll do, and (c) potentially set us up for a lot of work to meet those expectations?

What smarts are lacking?

A templated backend can do precisely the same thing as a regular backend, so I fail to see what expectations there may be that we wouldn't meet?

Dec 18 '23 21:12 kaos

oh, and thanks for reviewing and raising questions, it's precisely what we need to settle this :) 🙏🏽

Dec 18 '23 21:12 kaos

Another shift, to stay closer to packages in backend_packages could be to instead have:

[GLOBAL]
backend_packages.add = [
    "pants.backend.python",
    # the below are configured/templated backends and must still be referred to in the backend_packages list.
    "pants.backend.experimental.adhoc.code_quality_tool",
]

[[GLOBAL.templated_backends]]
package = "pants.backend.experimental.adhoc.code_quality_tool"
goal = "lint"
target = "//:flake8_linter"
name = "Flake8"

[[GLOBAL.templated_backends]]
package = "pants.backend.experimental.adhoc.code_quality_tool"
goal = "lint"
target = "//:markdownlint_linter"
name = "Markdown Lint"

So, when loading backends, we check if there are any configurations for the package in templated_backends and if so load one instance for each configuration. In case there are no configurations we load the backend as usual.

The main difference to previous is that we don't use the names in the backend_packages and thus all variants sharing a package are enabled/disabled as a single unit, using the package option as the connecting value to the backend package.

I kind of like this idea, as it keeps the backend_packages list clean.. and if a backend package is loaded without configuration that is meant to have that we it can raise an error explaining the situation (or if Pants detects this out of the box...)

Dec 18 '23 21:12 kaos

These are definitely not packages in any sense that overlaps with existing usage. I'm not even sure if they'd count as backends? Conceptually, they're jazzed up code generators.

Very much +1 that we should be careful to have coherent meanings for the words backend, backend_package or else things can be a headache and a confusion down the road. In the current version of the PR, with the backend aliases in the backend_packages list, I think they qualify as backends, if we define a backend to be:

backend: a set of Targets, Rules, build aliases, etc that work together to power some bit of pants end-user functionality.

I agree that they are code generators. In this case a kind of code generator for a pants plugin, creating a bunch of types and rules relating them to goals and known union bases.

As to:

I'm worried how this would interact e.g. with code generators, or native Python extensions.

The only changed introduced here or, for that matter, in the original PR - is a pattern of dynamically generating pants types and rules during the process of backend-loading.

Andreas, the issue with:

[GLOBAL]
backend_packages.add = [
    "pants.backend.python",
    # the below are configured/templated backends and must still be referred to in the backend_packages list.
    "pants.backend.experimental.adhoc.code_quality_tool",
]

is that backend order matters for things like fmt and fix. https://github.com/pantsbuild/pants/issues/17729#issuecomment-1843840932 , something I constantly need to be reminded of.

We can also not do this

I'd very much like us to leave on the table the option of not introducing this new bit of .toml meaning and backend loading. As of https://github.com/pantsbuild/pants/pull/20135 it is already fairly straightforward to enable new code quality tools. Yes, the repo admin won't be able to just do stuff in pants.toml and BUILD files. But this is all that is needed:

pants.toml

[GLOBAL]
pythonpath = ["%(buildroot)s/pants-plugins"]
backend_packages.add = [
    "pants.backend.python",
    # code_quality_tool's target and rules are defined in the adhoc backend
    "pants.backend.experimental.adhoc",
    # thin in-repo plugin activates a particular target as a linter
    "flake8_linter_plugin",
]

and an in-repo plugin pants-plugins/flake8_linter_plugin/register.py

from pants.backend.adhoc.code_quality_tool import CodeQualityToolRuleBuilder

def rules():
    cfg = CodeQualityToolRuleBuilder(
            goal="lint", target="//:flake8_tool", name="Flake8", scope="flake8_tool"
    )
    return cfg.rules()

This is already supported and does not require introducing any new concepts, bits of syntax or overloading of syntax, and the user retains full control over things like ordering.

On the other hand, users do seem to have apprehension to go into writing their own in-repo plugins, and maybe that will also happen for such a microscopic one.

Dec 19 '23 12:12 gauthamnair

backend order matters for things like fmt and fix. #17729 (comment) , something I constantly need to be reminded of.

Ah, thank you for keeping that top of mind. Good point, I thought that would be solved by having the GLOBAL.templated_backends as a list, since that will give as a determined order, but I realize that you still have no control of where the regular backends go if they need to go in between the templated ones for a particular templated backend.

I still think this issue with order being significant is a dumb issue to have and ought to be resolved.. but until then we'll have to deal with it.

Dec 19 '23 14:12 kaos

But there's certainly a lack of smarts that separates them from regular backends. I'm worrying that by equating them we'll (a) make teaching Pants harder, (b) have a mismatch in expectation of what they'll do, and (c) potentially set us up for a lot of work to meet those expectations?

What smarts are lacking?

A templated backend can do precisely the same thing as a regular backend, so I fail to see what expectations there may be that we wouldn't meet?

I mentioned a few cases where I think it'd be hard to use this efficiently. By smarts I mean it cannot use any deeper knowledge about the language to run correctly -- it only runs on a glob of files (and whatever smarts are used to generate that glob). And then there's partitioning better to improve caching, and so on. That's fine on efficient tools, but this would suck horrendously for something like cargo clippy because you'd recompile everything from scratch each time you ran. And I'm not raising this point as an argument this mechanism in general, more so about the semantic mixup.

And while it's true it can do the same things, especially with more configuration options, at some point the configuration to get exactly what's needed (again: see clippy) would likely be less maintainable, definitely less shareable, and still very verbose.

Very much +1 that we should be careful to have coherent meanings for the words backend, backend_package or else things can be a headache and a confusion down the road. In the current version of the PR, with the backend aliases in the backend_packages list, I think they qualify as backends, if we define a backend to be:
backend: a set of Targets, Rules, build aliases, etc that work together to power some bit of pants end-user functionality.
I agree that they are code generators. In this case a kind of code generator for a pants plugin, creating a bunch of types and rules relating them to goals and known union bases.

Sure. I think my concern is mostly that I put a lot of emphasis on the "smarts" side, where this makes tradeoffs that makes it unfit for a bunch of use-cases. I'll concede that maybe that isn't part of the definition, semantically. I do however also maintain that there's a semantic difference between a dedicated plugin with multiple backends compared to this. That's maybe less important the package-vs-non-package distinction.

As to:

I'm worried how this would interact e.g. with code generators, or native Python extensions.

The only changed introduced here or, for that matter, in https://github.com/pantsbuild/pants/pull/20135 - is a pattern of dynamically generating pants types and rules during the process of backend-loading.

I'm aware, it's just my general line of reasoning around whether these count as "backends", as in -- are there footguns here that warrants keeping them more or less separate from current backends.

is that backend order matters for things like fmt and fix. https://github.com/pantsbuild/pants/issues/17729#issuecomment-1843840932 , something I constantly need to be reminded of.

I much prefer the separate-but-equal approach. How big is the risk here w.r.t. ordering? Is it something we can solve in different ways? It seems like a footgun in general -- though I guess with the split approach it's unsolvable without other mechanics.

I'd very much like us to leave on the table the option of not introducing this new bit of .toml meaning and backend loading. As of https://github.com/pantsbuild/pants/pull/20135 it is already fairly straightforward to enable new code quality tools. Yes, the repo admin won't be able to just do stuff in pants.toml and BUILD files. But this is all that is needed:

If I had to rank my ideal workflows it'd probably be "WORKSPACE-style"/BUILD, followed by pants.toml, followed by this dummy backend model. That to me seems like sacrificing both shareability[^1] and simplicity.

[^1]: This is probably me being me here, as technically these could be shared like any other plugin. It just seems weird someone would go through the process of publishing a plugin with 5 lines of code. Plus that sort of defeats the point of a simple system, if the easiest way to use it is let someone else do it....

Dec 19 '23 16:12 tgolsson

Hi all, I'm wondering what the status of this is?

Feb 16 '24 04:02 benjyw

@benjyw The PR is itself in a good state, but there is debate about whether this is the right move or whether what is offered here will provide value greater than its weight. I can pick it back up if the community can see a path forward. It was otherwise, in my opinion, starting to fall in the "when in doubt, leave it out" category.

Feb 18 '24 12:02 gauthamnair