vector
vector copied to clipboard
Add `kubernetes_metadata` transform
Motivation
We already have a kubernetes_logs source that collects the Pod logs in the Kubernetes environment, and it covers all of the common use cases.
However, the Kubernetes ecosystem is huge, and advanced users also often have are a lot of uncommon use cases. We can't possibly provide first-class support for all of them, but we can empower users with the right tools to tailor Vector for their unique needs.
The main concern users have in the Kubernetes environment in relation to log events is enriching the events with the relevant data from the Kubernetes state - things like the name of the Pod the event is originating from. As mentioned above, this is already covered by the kubernetes_logs - but only for the events from the kubernetes_logs source.
So far, we have recognized a number of cases that we want to support, but don't want to include in the kubernetes_logs source:
-
Sidecar deployments
Deploying Vector as a sidecar (a secondary
containerwithin aPod). This is usually used when there's an app doesn't write its logs to stdout, and uses files instead. In this operation mode, Vector would typically need to be used with afilesource, and will have to fetch the information about thePodit runs as (and only thatPod!) from thekube-apiserverand annotate all the events from the file source with thePodmetadata.Refs:
- https://github.com/timberio/vector/issues/5040
-
Cluster that uses
journaldfor logsWhen using Docker as Kubernetes container runtime, it is possible to configure Docker to use the
journaldlog driver. With this configuration, logs won't be available as files on disk, and thekubernetes_logssource won't be usable. There are myriads of possible non-standard configurations like this, so we don't want to include support for them at thekubernetes_logssource - first of all to keep things simple for the users that are on the standard use case, but also because it is virtually impossible to support of all the configurations while adding the flexibility in there significantly increases the maintenance required maintenance efforts. In other words - supporting this use case via a transform makes the most sense.In this operation mode, Vector is deployed on each node (the recommended way is still to do it via via
vector-agentHelm chart in this case), and ajournaldsource is used in conjunction with an annotating transform. This way, a similar outcome can be achieved as when using thekubernetes_logssource, while thekubernetes_logssource is not used.Refs:
- https://github.com/timberio/vector/issues/2199
Requirements
To be able to cover all of the use cases, we have to build a solution that is quite flexible.
To reduce the load, we should use the same state-sync architecture that we use at the kubernetes_logs source, however it needs to be more user-configurable and thus more generic at the code level.
We want to support:
- various resource types to load from the Kubernetes state (
Pods/Namespaces/Services/etc) - various ways to filter the slice of the Kubernetes state to load (
Pods in all namespaces / a singlePodby the specified namespace and pod names / etc - potentially passing an arbitrary resource endpoint to watch) - various ways to match the events with the synced (loaded) Kubernetes state (i.e. match the
Podrecord with event by the pod uid taken from one of the event fields, or by container uid and one of the fields, etc); a perspective idea to achieve this is build indexes over the synced state, seejournalbeatimplementation (TODO: add link) - the ability to annotate the event with arbitrary fields from the matching resource (most likely through passing the paths within the resource)
What would be great to support eventually:
- the ability to build state-sync hierarchies, where one state loader (reflector) dynamically reconfigures itself based on the state obtained from another state loader (reflector) - this will be unlocked with #4214
Design considerations
Configurability and defaults
The solution has to be very configurable, aimed at the advanced users, and designed to cover edge use cases, and it means there are very few defaults that we could sanely apply. This is contrary to the kubernetes_logs source, which was designed to work out of the box with minimal configuration and be a solid solution to the one most common use case.
We should still at least try to make the configuration as easy and intuitive as possible.
Use of generics
Due to the nature of the task, we'll likely have to build most of the code around generic primitives like k8s_openapi::Resource and serde::de::DeserializeOwned (not sure if the names are precise but you get the idea), rather than using concrete types like k8s_openapi::[...]::Pod.
QA
To ensure the proper quality, we'll have to cover the implementation with both E2E tests and the unit tests.
The E2E tests can consist of just two cases as a start:
- A simple case of reading the files via the
filesource and annotating the events similar to thekubernetes_logssource. - A test simulating the sidecar deployment, where we Vector is configured to generate events and annotate them with the Kubernetes state of its own
Pod.
Ideally, we'd want to have more test scenarios, but we can add more as we go.
Proposed implementation plan
- Implement a generic resource (
AnyResource) to be able to work with arbitrary Kubernetes resources. It must implementk8s_openapi::Resourceandserde::de::DeserializeOwned. - Implement a generic configurable watch request builder to be able to build the arbitrary watch requests to the Kubernetes API as configured by the user.
- Implement a state layer that would allow quick lookups by the user-configurable lookup fields (aka configurable indexer).
- Implement a configurable annotator to fill-in arbitrary user-configured fields from an arbitrary resource into the event.
- Implement a configurable event dropping, a mechanism to allow users to drop events based on some predicate rules.
- Tie all this together in a transform.
Open questions
- Do we implement our own custom predicate logic for dropping events, or leave it and request that users just use the
reduce/remaptransform / etc?
Closing, see https://github.com/timberio/vector/pull/5317#issuecomment-762377958.
Reopening this just to track additional use-cases / reports like https://github.com/vectordotdev/vector/issues/20366