[RFC] Grafana Agent flow mode plugins

Open rfratto opened this issue 11 months ago • 10 comments

The most recent copy of this proposal can be found on Google docs. The below is the original version of this proposal for posterity.

NOTE: This proposal is likely being written 8-12 months too soon. However, plugins are likely to be the last foundational piece of the future of Grafana Agent (following flow mode and modules). As the last foundational piece, it is important to get alignment on future plans early so we do not take any actions that goes against the long-term goals.

If you are reading this and are excited for plugins, do not expect rapid movement or delivery any time soon. The best case scenario is that plugins are made generally available in a 2025 release.

Background

Currently, new capabilities to Grafana Agent Flow can only be added by contributing a new component to the official Git repository (https://github.com/grafana/agent).

Having a centralized repository of components causes issues for an open source project to thrive:

Grafana Agent maintainers may disagree with the value or implementation of a new component, preventing others from using it.
Accepting a new component implies that it is "official" and receives the same amount of support as all other components.
All components must share the same Apache 2.0 licensing; preventing components which can not follow this license, such as a component to launch a Loki server.
The project ends up with a massive amount of dependencies, causing dependency hell issues.
All components are bound to the same release cycle, even if newer components would benefit from more frequent releases.

Other projects, such as the OpenTelemetry Collector, solve this problem by having different distributions of the collector. While distribution solves the issues above, it also fragments the community, as different distributions may have different subsets of components, making migration between distributions difficult.

I propose that we support a plugin system for Grafana Agent flow mode, allowing sets of components to be provided by external plugins which can be loaded at runtime into the Grafana Agent process.

This proposal serves as a high-level proposal of plugins to achieve maintainer and community consensus on the long-term goals.

Goals

Get consensus on the need for plugins.
Establish baseline requirements for plugins.
Establish proposals to follow this proposal.
Establish phases for how plugins will be introduced.

Non-goals

Establish any technical information about plugins.

Proposal

Flow mode should introduce the concept of a "plugin," where a plugin is some loadable code that provides one or more components that can be defined in a Flow configuration.

The mechanism through which plugins are created, retrieved and defined are not in scope for this proposal:

Some potential creation mechanisms may be: a binary, a WASM module, or a Lua script
Some potential retrieval mechanisms may be: GitHub releases or a centralized registry like NPM.
Some potential definition mechanisms may be: a configuration block in a Flow file or flags passed to Grafana Agent.

Requirements

These are high-level requirements that plugins must achieve:

Performant: components from plugins must have minimal performance overhead compared to native components.
OS independent: plugins must be available to use on all supported operating systems (Linux, FreeBSD, macOS, Windows).
Ability to migrate existing components: existing components must be able to work as plugins (with changes to their implementation if necessary)

If it is not possible for us to create a plugin system which meets these two requirements, we should consider abandoning the plugin model.

Performance

The biggest concern with plugins is performance of component communication. Today, communication between two components (such as prometheus.scrape sending metrics to prometheus.remote_write) is largely achieved using shared memory, as it's internally represented by a native function call.

However, plugin communication is unlikely to be able to use shared memory; the only mechanism where shared memory is available is through the plugin package, but that doesn't support Windows, making it OS dependent. All other potential communication mechanisms will involve some kind of message marshaling and unmarshaling between running plugins and the plugin host (Grafana Agent).

Before full development on plugins begins, a proof of concept is needed that demonstrates the overhead of message marshaling and unmarshaling to prove the viability of plugins.

Sub-proposals

For plugins to be fully realized, we need at least these five proposals to build on top of this one:

Plugin component communication: explore the mechanism through which messages are sent between components across plugins for tasks such as sending telemetry data, and whether it's possible for plugins to be performant.
Plugin communication protocol: explore the mechanism through which Grafana Agent will communicate with plugins, allowing components to be created from a plugin.
Exposing services to plugins: explore the mechanism through which plugins will be allowed to make use of services defined in flow mode, such as the clustering and http service.
Plugin retrieval and versioning: explore the mechanism through which Grafana Agent will download plugins, and how plugin versioning is considered.
Definition of plugins and plugin components: explore the mechanism through which users of Grafana Agent will define a dependency on a plugin, and how that user will be able to create components which are provided by a plugin.

These proposals will likely be written by different people across a large period of time. The first proposal, plugin component communication, is a prerequisite for all other plugins, as it will prove or disprove whether plugins can be performant.

Delivery plan

Assuming plugins are viable, it will be delivered in four phases:

Experimental behind feature flag: At this stage, plugins will be considered highly experimental, and will require a usage flag to enable.
Beta without feature flag: At this stage, all Grafana Agent flow mode users will be able to write and use plugins without a feature flag, signaling that plugins will be made stable and are here to stay.
Stable without feature flag: At this stage, Grafana Agent plugins are considered to be stable and production-ready.
Move existing components to plugins: Finally, existing components should be evaluated for whether it makes sense to make them plugins instead. Since this would be a breaking change, it should be included with a major version release (such as a v2.0).

Sep 15 '23 01:09 rfratto

Plugin component communication: explore the mechanism through which messages are sent between components across plugins for tasks such as sending telemetry data, and whether it's possible for plugins to be performant.

Plugins have been on my mind a lot, especially whether they're even viable.

I would like to personally write this next proposal, but I would like to see others take the other 4 (though I'll still want to be involved to some extent).

Sep 15 '23 01:09 rfratto

A few thoughts and concerns:

Development Experience

One of the key selling points of the flow philosophy is that a component is a single go package that self-registers and is a relatively standalone, testable piece of code. I wouldn't want plugins to introduce a different experience. If a developer has to choose whether they are making a plugin or a compiled component early in the process, that would be undesirable. Ideally, you should be able to import the exact same go package without modifications and run it as a plugin the same as if it were compiled in for maximum portability and flexibility. That may be unrealistic, but I think it should be a goal. I would hate to fracture a very young open-source community into vastly different runtime modes (which is why lua would be a very hard sell for me).

Living without plugins

The status quo is that all of the current components are compiled into the agent binary. The self-registration mechanism makes that really nice because you can import with _ and be done with it. This proposal lists a few "political" reasons components would not be included in the main repo, but none of those preclude somebody from compiling the agent themselves with a custom imports file.

Since plugins will almost certainly have a performance cost, I'd argue they need to have a significant ease of use benefit over the current paradigm to be worth it. I remain skeptical any of the currently available solutions for go will do that, but I'd love to be proven wrong.

We could alternately dedicate time to normalizing and facilitating the creation of custom agent binaries with arbitrary combinations of component packages. I made a proof of concept for personal uses, and it has some rough edges, but was not intolerable. With some docs and maybe some tooling, we could make it pretty easy for somebody to create an agent with (or without) whatever components they want.

Sep 15 '23 19:09 captncraig

Ideally, you should be able to import the exact same go package without modifications and run it as a plugin the same as if it were compiled in for maximum portability and flexibility. That may be unrealistic, but I think it should be a goal.

This is a goal I share, but it's not discussed in this design doc since I don't go over the API at all. It should be possible, but it may require new restrictions on a component's API.

In particular, if plugins are built using WASM or system binaries, exporting interfaces introduces a new challenge. The plugin engine would ned to be able to provide some value for an interface across plugins, but Go doesn't allow interfaces to be built at runtime. A workaround for this is to introduce some kind of code generation to build interface implementations, but I don't know if that's something we'd want to do since it complicates the build process.

It is, however, possible to build functions at runtime. If we were to restrict our existing APIs such that you could only export structs of functions, then plugins would be able to work as native components do today, and both native components and plugins would be built exactly the same.

Since plugins will almost certainly have a performance cost, I'd argue they need to have a significant ease of use benefit over the current paradigm to be worth it. I remain skeptical any of the currently available solutions for go will do that, but I'd love to be proven wrong.

Developing a component in a plugin will not be easier than the current paradigm, but it won't be harder either.

However, plugins solve important problems that we've been facing:

Dependency hell: our list of >100 components means we're frequently in dependency hell. Breaking up our existing components into a smaller set of plugins would dramatically help with this.
Binary size: our binary is huge, which is a problem for people who don't use half of the components we offer.
Release synchronization: all components must be released at the same time; a plugin model means a subset of components could be released at a different cadence.

We could alternately dedicate time to normalizing and facilitating the creation of custom agent binaries with arbitrary combinations of component packages

This is exactly what the RFC is arguing that we shouldn't do:

Other projects, such as the OpenTelemetry Collector, solve this problem by having different distributions of the collector. While distribution solves the issues above, it also fragments the community, as different distributions may have different subsets of components, making migration between distributions difficult.

I propose that we support a plugin system for Grafana Agent flow mode, allowing sets of components to be provided by external plugins which can be loaded at runtime into the Grafana Agent process.

We see this pattern with OpenTelemetry collector, and I'm overall not a fan of the distribution-type model for the reason above. The scenario of "this distribution doesn't have a component I want, so I have to fork it or beg the maintainers to add it in" can be seen as user-hostile.

Do you have counterarguments for why adopting a distribution model is better than a plugin model?

Sep 15 '23 20:09 rfratto

I'm coming at this from a slightly different angle having recently switch to Grafana Agent after using either vector.dev or one of the various distros of the OTEL collector for the past couple of years.

The first thing I think is important to note is that having a single binary with all the "plugins" installed into it (whether they are the plugins that are being proposed, existing components or a combination of both) is actually really handy for most users as it means they don't have to worry about whether they've compiled the correct code and can just deploy a single binary/container. This is not, however, advocacy for keeping the status quo, and the point about OTEL Distros is very valid!

I have always appreciated the DataDog approach to plugins, which boils down to "You want to install it easily? You contribute upstream. You want something specific to you? Drop the code into this directory and we'll pick it up, but we won't support it". This approach gives the flexibility of custom plugins whilst maintaining plugin quality in the "core" repo.

The idea that a user could develop a plugin locally, run it on their own platform, and then contribute to "core" if they wanted to is a nice pattern, and it even allows folks to release plugins under their own github/NPM/whatever repo and have Grafana Agent "pick it up" from a directory on the filesystem if they want to use another license.

It does, however, mean that the agent needs to be able to load from disk at launch, and potentially be able to "reload" everything from disk whilst running depending on how advanced we want to make it.

Not sure if that makes sense, so ask any questions and I'll do my best to clarify! :laughing:

Sep 16 '23 15:09 proffalken

I'll give my 2 cents here as this kinda hit a soft spot for me recently.

One could argue that the approach that open telemetry took with the official and contrib distro of their collector allowed vendors to provide support for their own specific platform. Making the collector as agnostic as can be. but in the reality of things, the freedom of users in way was greatly reduced.

Pros

Allowing plugins and extensions to be easily added opened up the community both individual OSS developers and vendors to extend the functionality, it preserved the ability to keep "core" support on the official collector and not be "cornered" into giving out a guarantee of support for components not developed and approved by the core community.

Cons

Allowing vendors to develop plugins which provide custom support for their platforms allow the vendors to start implementing logic and requirements which diverge from the Otel spec. this causes some issues as the vendor now requires the users to implement custom logic in their systems which essentially create a sort of vendor locking.

My example is simple, a vendor i am using requires OTEL signals to be exported with 4 specific headers which indicate classification of the signals sent and is used to index the signals. these headers are not able to be provided with dynamic values based on the signal sent so i'm left with having to use their provided distro of the otel collector contrib which knows how to perform some deduction of headers from the signal being passed through the exporter. this makes life harder to adopt different agents as not all vendor are happy to develop support for all shipping agents.

Keeping stuff as close the the standard is always a good idea on the users perspective as it keeps the freedom to choose the solution which best fits their own specific needs.

Between this RFC and making the river language as complete as possible. I would choose the latter anytime.

With that said. i completely agree with @proffalken about always keeping stuff as a single binary. if i need to compile the agent on my CI it make life so much harder as i need to keep track of changes in the build process of the agent instead of having to simply download a released version and run it. in that sense creating a contrib distro of the agent and providing an easy interface for devs to add components while keeping up with the upstream is the best way to go for these types of things IMHO.

Sep 19 '23 09:09 oferziss-armis

I'm excited to see how this plays out; plugins sound like an exciting approach to a more modular Agent in the future! 👀

One thing that sticks out was the "Ability to migrate existing components", and whether this should be a hard requirement. I feel the added value of plugins might be enough even if we cannot migrate all current components. For example, if the performance overhead was a bit too much for say, prometheus.scrape to be usable as a plugin, I don't think it shouldn't allow us to use other components as plugins.

Are you worried that the two classes of components that come with a performance related warning would feel like second-class citizens?

Sep 19 '23 12:09 tpaschalis

Addressing some of the comments here:

This proposal was very high level, so it probably didn't do a great job at helping envision what plugins could potentially be.

I could imagine adding something like this to a Flow config:

// Imports components from the plugin in the "otelcol" namespace.
plugin "otelcol" {
  url     = "github.com/grafana/flow-plugin-opentelemetry-collector"
  version = "1.0.1" 
}

// Imports components from the plugin in the "older_otelcol" namespace.
plugin "older_otelcol" {
  url     = "github.com/grafana/flow-plugin-opentelemetry-collector"
  version = "0.8.5" 
}

otelcol.receiver.otlp "default" { 
  http {}
  grpc {}

  output {
    traces = [older_otelcol.exporter.otlp.default] 
  }
}

older_otelcol.exporter.otlp "default" { ... }

This hypothetical design has a few interesting attributes:

Users do not have to compile the agent or the plugins; they just specify which plugins they want the agent to use at runtime.
It allows using different versions of the same plugin, for example, if a newer version has a bug, or if you want to try an experimental component from an unreleased version.

This is just a sketch, and I'm not sure what the final proposal would look like, but I do not want plugins to require people to recompile the agent.

cc @proffalken

Allowing vendors to develop plugins which provide custom support for their platforms allow the vendors to start implementing logic and requirements which diverge from the Otel spec.

IMO, this is a good thing. Locking in components to only do OpenTelemetry will cause progress to be bottlenecked by when OpenTelemetry adopts a change. By necessity of attempting to be a global standard for all of telemetry data, OpenTelemetry will be slower to adopt new additions, as it needs to be careful.

For example, the pyroscope.* components in Flow do not use OpenTelemetry, since OpenTelemetry is still in progress of adopting a spec for profiles.

I don't want Flow to be limited to only OpenTelemetry components, and we don't even do that today; we have multiple sets of plugins from different ecosystem (prometheus., loki., pyroscope., otelcol., discovery.*).

I also don't want to limit plugins to only dealing with telemetry data. If someone wants to write a plugin with a component that provisions architecture, they should be free to do so.

in that sense creating a contrib distro of the agent and providing an easy interface for devs to add components while keeping up with the upstream is the best way to go for these types of things IMHO.

Unfortunately, the -contrib approach really doesn't fix the problems the maintainers are facing today as I mentioned earlier, specifically the one around dependency hell. If someone wants ~all the components, they will struggle to keep their distribution up to date.

Plugins will be a challenge to implement, but I think it will give users much more flexibility around what components are used, and prevent community fragmentation as there will only be one official binary of Flow, with many different plugins for different components to use.

cc @oferziss-armis

Are you worried that the two classes of components that come with a performance related warning would feel like second-class citizens?

Yes, and I also don't want to play favorites :) It would feel weird to me personally if we said "prometheus, otelcol, loki, pyroscope all get to stay in core for performance but everything else must be a plugin," especially since two of those are Grafana Labs products. We can make Flow be an open platform, but it means playing on the same field as everyone else.

I would prefer us to measure the overall impact of plugins and try our best to make the overhead as small as possible so we can become that open platform.

cc @tpaschalis

Sep 19 '23 13:09 rfratto

I think we should firstly decide on what the user experience should be. For example:

What is the syntax for loading plugins?
What is the syntax for using them?
Are there any security concerns? Should we have restrictions on what plugins can be ran?

The other proposals should be based off of that user experience goal. However, I am not sure if this should fall within this proposal or a sub-proposal.

Oct 16 '23 16:10 ptodev

At this point, we're not sure what the technical limitations of plugins are. That will drive what we're going to be able to deliver, which may change what we end up exposing to end users.

While I'd normally agree to start from the user experience, I think this is a problem where some (but not all) technical information needs to be figured out first.

Oct 16 '23 19:10 rfratto

Adding my 2 cents since I'm interested. Would love to see something where we could easily import or use telegraf plugins, since there are so many... Either linked as a go plugin, or referenced in code? Not sure...

Apr 23 '24 16:04 srclosson

alloy alloy copied to clipboard

[RFC] Grafana Agent flow mode plugins

Background

Goals

Non-goals

Proposal

Requirements

Performance

Sub-proposals

Delivery plan

Development Experience

Living without plugins

Pros

Cons

alloy
alloy copied to clipboard