vector Add `kubernetes_metadata` transform

Motivation

We already have a kubernetes_logs source that collects the Pod logs in the Kubernetes environment, and it covers all of the common use cases.

However, the Kubernetes ecosystem is huge, and advanced users also often have are a lot of uncommon use cases. We can't possibly provide first-class support for all of them, but we can empower users with the right tools to tailor Vector for their unique needs.

The main concern users have in the Kubernetes environment in relation to log events is enriching the events with the relevant data from the Kubernetes state - things like the name of the Pod the event is originating from. As mentioned above, this is already covered by the kubernetes_logs - but only for the events from the kubernetes_logs source.

So far, we have recognized a number of cases that we want to support, but don't want to include in the kubernetes_logs source:

Sidecar deployments

Deploying Vector as a sidecar (a secondary container within a Pod). This is usually used when there's an app doesn't write its logs to stdout, and uses files instead. In this operation mode, Vector would typically need to be used with a file source, and will have to fetch the information about the Pod it runs as (and only that Pod!) from the kube-apiserver and annotate all the events from the file source with the Pod metadata.

Refs:
- https://github.com/timberio/vector/issues/5040
Cluster that uses journald for logs

When using Docker as Kubernetes container runtime, it is possible to configure Docker to use the journald log driver. With this configuration, logs won't be available as files on disk, and the kubernetes_logs source won't be usable. There are myriads of possible non-standard configurations like this, so we don't want to include support for them at the kubernetes_logs source - first of all to keep things simple for the users that are on the standard use case, but also because it is virtually impossible to support of all the configurations while adding the flexibility in there significantly increases the maintenance required maintenance efforts. In other words - supporting this use case via a transform makes the most sense.

In this operation mode, Vector is deployed on each node (the recommended way is still to do it via via vector-agent Helm chart in this case), and a journald source is used in conjunction with an annotating transform. This way, a similar outcome can be achieved as when using the kubernetes_logs source, while the kubernetes_logs source is not used.

Refs:
- https://github.com/timberio/vector/issues/2199

Requirements

To be able to cover all of the use cases, we have to build a solution that is quite flexible.

To reduce the load, we should use the same state-sync architecture that we use at the kubernetes_logs source, however it needs to be more user-configurable and thus more generic at the code level.

We want to support:

various resource types to load from the Kubernetes state (Pods/Namespaces/Services/etc)
various ways to filter the slice of the Kubernetes state to load (Pods in all namespaces / a single Pod by the specified namespace and pod names / etc - potentially passing an arbitrary resource endpoint to watch)
various ways to match the events with the synced (loaded) Kubernetes state (i.e. match the Pod record with event by the pod uid taken from one of the event fields, or by container uid and one of the fields, etc); a perspective idea to achieve this is build indexes over the synced state, see journalbeat implementation (TODO: add link)
the ability to annotate the event with arbitrary fields from the matching resource (most likely through passing the paths within the resource)

What would be great to support eventually:

the ability to build state-sync hierarchies, where one state loader (reflector) dynamically reconfigures itself based on the state obtained from another state loader (reflector) - this will be unlocked with #4214

Design considerations

Configurability and defaults

The solution has to be very configurable, aimed at the advanced users, and designed to cover edge use cases, and it means there are very few defaults that we could sanely apply. This is contrary to the kubernetes_logs source, which was designed to work out of the box with minimal configuration and be a solid solution to the one most common use case.

We should still at least try to make the configuration as easy and intuitive as possible.

Use of generics

Due to the nature of the task, we'll likely have to build most of the code around generic primitives like k8s_openapi::Resource and serde::de::DeserializeOwned (not sure if the names are precise but you get the idea), rather than using concrete types like k8s_openapi::[...]::Pod.

QA

To ensure the proper quality, we'll have to cover the implementation with both E2E tests and the unit tests.

The E2E tests can consist of just two cases as a start:

A simple case of reading the files via the file source and annotating the events similar to the kubernetes_logs source.
A test simulating the sidecar deployment, where we Vector is configured to generate events and annotate them with the Kubernetes state of its own Pod.

Ideally, we'd want to have more test scenarios, but we can add more as we go.

Proposed implementation plan

Implement a generic resource (AnyResource) to be able to work with arbitrary Kubernetes resources. It must implement k8s_openapi::Resource and serde::de::DeserializeOwned.
Implement a generic configurable watch request builder to be able to build the arbitrary watch requests to the Kubernetes API as configured by the user.
Implement a state layer that would allow quick lookups by the user-configurable lookup fields (aka configurable indexer).
Implement a configurable annotator to fill-in arbitrary user-configured fields from an arbitrary resource into the event.
Implement a configurable event dropping, a mechanism to allow users to drop events based on some predicate rules.
Tie all this together in a transform.

Open questions

Do we implement our own custom predicate logic for dropping events, or leave it and request that users just use the reduce/remap transform / etc?

Nov 17 '20 15:11 MOZGIII

Closing, see https://github.com/timberio/vector/pull/5317#issuecomment-762377958.

Jan 18 '21 17:01 binarylogic

Reopening this just to track additional use-cases / reports like https://github.com/vectordotdev/vector/issues/20366

Apr 24 '24 15:04 jszwedko

vector vector copied to clipboard

Add `kubernetes_metadata` transform

Motivation

Requirements

Design considerations

Configurability and defaults

Use of generics

QA

Proposed implementation plan

Open questions

vector
vector copied to clipboard