vector icon indicating copy to clipboard operation
vector copied to clipboard

Vector stops reading logs from EKS with AL2023 based AMIs

Open mscanlon72 opened this issue 1 month ago • 30 comments

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Vector stops reading logs from EKS worker nodes with AL2023 based AMIs. New worker nodes with AL2023 backed AMI were rolled out for the EKS clusters. Some hours later, Vector pods stop collecting logs. Using vector tap we cannot see any events in the output of the kubernetes_logs source.

Configuration

customConfig:
    data_dir: /var/lib/vector
    api:
      enabled: true
      address: 127.0.0.1:8686
      playground: true
    # sources -- Sources for Vector to pull data
    ##
    ##
    sources:
      # Internal logs for Vector
      internal_logs:
        type: internal_logs
      # Internal metrics for Vector
      internal_metrics:
        type: internal_metrics
        scrape_interval_secs: 10
      # Kubernetes logs
      k8s_logs:
        type: kubernetes_logs
        glob_minimum_cooldown_ms: 500
        max_read_bytes: 2097152

Version

0.48.0, 0.49.0, 0.50.0 -distroless

Debug Output


Example Data

No response

Additional Context

Rollback to AL2 backed AMIs resolves the issue. AL2 is end of life in Nov '25.

References

No response

mscanlon72 avatar Oct 30 '25 20:10 mscanlon72

I am assuming Vector is supported on AL2023 EKS AMIs?

mscanlon72 avatar Oct 30 '25 23:10 mscanlon72

Is there a debug version of vector available?

mscanlon72 avatar Oct 31 '25 18:10 mscanlon72

To add some more context, the kubernetes_log source stops working. The theory is a stuck thread. All other components are still working and processing data. We have trace logs sent into splunk, so we provide anything requested. We do not see any failure or error messages on the source. It no longer logs anything, supporting the claim of a stuck thread. The destination we use for logs is splunk_hec, and it continues to receive the internal logs we send. This is the same endpoint that we send the souce kubernetes logs to. This indicates that the destination isn't an issue as it is still receiving the internal logs.

We are trying the alpine variant. Any help would be appreciated.

@pront @jszwedko

mscanlon72 avatar Nov 03 '25 16:11 mscanlon72

Another data point, when a pod exhibits the issue and stops reading logs, the number of timeouts we see on the splunk_hec destination increase dramatically on those pods only. Other pods do not show this increase.

mscanlon72 avatar Nov 03 '25 17:11 mscanlon72

Hi @mscanlon72, thanks for this report

Is there a debug version of vector available?

You can run vector with the environment variable VECTOR_LOG=debug or even VECTOR_LOG=trace (which will be very spammy but will give you important trace logs such as this one and others in this file). This will show you debug information from all sources including the file source (the kubernetes_logs source is built on top of the file source).

A major difference between the kubernetes_logs source and the file source is that the former uses a custom paths provides, the K8sPathsProvider. If the source is not properly reading files it is very likely the issue is with that paths provider (granted the files weren't deleted/truncated/moved etc - if that's the case you may be hitting edge cases in the file source itself and then you'll need to debug the checkpoints.json file nested under the data_dir but this is likely not the case). K8sPathsProvider gets its events from kube::Api::all which gets passed into kube::runtime::watcher (source), meaning that all paths are essentially resolved by the kube library and the file source just naively loops over those.

One (maybe) big issue is that we're using kube v0.93.0 but they have since released v2.0.1, so it may be a matter of upgrading to the latest version and this issue could be fixed. That'd be the first step I'd take to try and fix this (along with turning on the debug/trace logs). You can try to update the dependency and create a custom build. Since kube has gone through two major version bumps since v0.93.0 (v1.0.0 in May 2025, v2.0.0 in Sept 2025), I'd recommend a progressive upgrade rather than jumping directly to v2.0.1. Their changelog provides more information as to what was updated, most notably v1.0.0 introduced support for Kubernetes v1.33 and v2.0.0 introduced support for Kubernetes v1.34.

thomasqueirozb avatar Nov 07 '25 16:11 thomasqueirozb

@thomasqueirozb, thank you for your response, we really appreciate it as we are kinda stuck with AL2 EKS AMIs expiring this month.

We have been running with TRACE logs enabled for about a week now, so we have lots of data to provide.

We are using the kubernetes_logs source not the file source. Been using this for a while now and this is really the first major issue we have run into. Also, we do see multiple clusters that are working without error. Some clusters that have the issue, but it only presents in some nodes, not all.

We are curious about the kube-rs library. However, the version of Kubernetes we are using is the same regardless of the AL2 or AL2023 EKS AMI release. Maybe something in the logs can provide some clarity on this potential issue.

We do have a team member that is looking into building a custom Vector release with an updated kube-rs library. We are planning to base it off of 0.49.0 as it is the version we are running in our infra.

We are running Kubernetes Server Version: v1.31.13-eks-113cf36, kube_node_kubelet_version:v1.31.7-eks-473151a with containerd kube_node_runtime_version:containerd://2.0.5

mscanlon72 avatar Nov 07 '25 18:11 mscanlon72

We are using the kubernetes_logs source not the file source.

I saw that from your config. I mentioned the file source the kubernetes_logs source uses the FileServer from the file source, therefore you might see some logs that are tagged from there and you should also keep an eye out for those 🙂

Let us know how the deployment with the updated kube library goes, hopefully that should give us some insight

thomasqueirozb avatar Nov 07 '25 19:11 thomasqueirozb

We will, is there anything we can look at in the logs in the meantime? Something we can target for information? We have tons of trace logs ready to query in Splunk.

mscanlon72 avatar Nov 07 '25 19:11 mscanlon72

Watching for the logs present in this file: https://github.com/vectordotdev/vector/blob/a488105710032b593051496bad9dc8df5e8cce6c/src/internal_events/file.rs

and these ones too https://github.com/vectordotdev/vector/blob/02671f454061bdb41f9600cafcff3b4f26bd3773/lib/file-source/src/file_server.rs#L199-L227

should give you at least some insight.

Unfortunately it's harder to know why something isn't happening rather than when it is happening... Watching the utilization metric is also a good idea but since you're not seeing anything out of vector tap then it wouldn't be so useful

thomasqueirozb avatar Nov 07 '25 20:11 thomasqueirozb

Those watched files messages disappear once the issue shows up.

mscanlon72 avatar Nov 10 '25 18:11 mscanlon72

Well then this seems like an issue in the K8sPathsProvider. I'd suggest adding some logs in that part of the code and inside this function too. See my previously linked block of code here. I now suspect that it is very likely something is wrong with kube library (or at least with the connection between Vector -> Kubernetes api which the library handles) because we basically use events from the API and forward them to the file server which then reads the files. If nothing is returned from the API then the paths provider won't generate any paths and no files are watched.

thomasqueirozb avatar Nov 10 '25 19:11 thomasqueirozb

Ok, I'm working on a custom build with newer libs. Even though the kubernetes version doesn't change when we us AL2 vs AL2023, you still thing the paths provider? We certainly thought it was a library issue, but were not able to confirm.

Point being, we've been running 1.31 for a while, it was only the switch to AL2023 that triggered this.

mscanlon72 avatar Nov 10 '25 23:11 mscanlon72

Looking to build and create a new container for vector w/the updated libraries. I have the updates to cargo and and done a dev build, but do you have any additional docs about how to build a multi-arch container with a custom build? Looks like I need to run build-x86_64-unknown-linux-musl and build-aarch64-unknown-linux-musl for an alpine build. Specifically, the container creation part. Ensuring all the vector configurations, envs, etc are correct.

mscanlon72 avatar Nov 11 '25 18:11 mscanlon72

Ok, I'm working on a custom build with newer libs. Even though the kubernetes version doesn't change when we us AL2 vs AL2023, you still thing the paths provider? We certainly thought it was a library issue, but were not able to confirm.

The paths provider get it's paths from the library. It doesn't seem like a bug in the path provider but in the kube library. That however is the most convenient interface inside Vector to debug that's why I made those comments regarding it.

For the custom builds, you should be able to run make package-x86_64-unknown-linux-musl and make package-aarch64-unknown-linux-musl to build and package. Outputs will be present inside of target/artifacts

thomasqueirozb avatar Nov 11 '25 18:11 thomasqueirozb

I ran make package-x86_64-unknown-linux-musl and I get the error below. Any further updates I need to run? I've seen this a few times now. I am on a Mac M1

 => ERROR [internal] load metadata for ghcr.io/cross-rs/x86_64-unknown-linux-musl:0.2.5                                      0.8s
------
 > [internal] load metadata for ghcr.io/cross-rs/x86_64-unknown-linux-musl:0.2.5:
------
ghcr.io/cross-rs/x86_64-unknown-linux-musl:0.2.5: failed to resolve source metadata for ghcr.io/cross-rs/x86_64-unknown-linux-musl:0.2.5: no match for platform in manifest: not found
make[1]: *** [Makefile:264: cross-image-x86_64-unknown-linux-musl] Error 1
make[1]: Leaving directory '/git/vectordotdev/vector'
make: *** [Makefile:291: target/x86_64-unknown-linux-musl/release/vector] Error 2

mscanlon72 avatar Nov 11 '25 18:11 mscanlon72

While I am learning to do this, would it be possible to kick off a custom build for us w/the updated libraries?

mscanlon72 avatar Nov 11 '25 19:11 mscanlon72

I ran make package-x86_64-unknown-linux-musl and I get the error below. Any further updates I need to run? I've seen this a few times now. I am on a Mac M1

This won't work on a Mac M1 (or any Arm Mac for that matter) but will on an x86 Linux machine. I ran into the same issue before, I think this is the tracking issue on their side for this: https://github.com/cross-rs/cross/issues/975

While I am learning to do this, would it be possible to kick off a custom build for us w/the updated libraries?

Do you have a branch with the updated libraries?

thomasqueirozb avatar Nov 12 '25 14:11 thomasqueirozb

I will get one ready, thank you.

mscanlon72 avatar Nov 12 '25 18:11 mscanlon72

@thomasqueirozb I am having a hard time getting a branch ready, but I wanted to provide more data points.

Vector pods that experience this issue are also showing a value of 1 for the utilization metric on two throttle components. These components throttle the logs read in from the kubernetes_logs source. Odd since no logs are flowing to the components since Vector is not reading in logs.

In addition, the offending pods CPU and Memory resources plateau: cpu drops and uses less than 1 millicore and holds there, memory holds flat with little changes to the value.

mscanlon72 avatar Nov 13 '25 23:11 mscanlon72

@thomasqueirozb and @pront I have submitted a PR to bump the features for k8s-openapi. Hopefully my understanding is correct and the features flag can be bumped. If a custom build could be started for me, that would be extremely helpful. I plan on submitting a second PR with updated versions of the libraries to reach support for v1_32.

https://github.com/vectordotdev/vector/pull/24263

mscanlon72 avatar Nov 18 '25 01:11 mscanlon72

@thomasqueirozb and @pront I have submitted a PR to bump the features for k8s-openapi. Hopefully my understanding it correct and the features flag can be bumped. If a custom build could be started for me, that would be extremely helpful. I plan on submitting a second PR with updated versions of the libraries to reach support for v1_32.

#24263

Started a build run for you: https://github.com/vectordotdev/vector/actions/runs/19470870947

Normally we would link to https://vector.dev/docs/setup/installation/manual/from-source/ and let the users create their own custom builds. We can definitely improve this, maybe by providing a script that takes a branch or fork & branch and creates a docker image.

pront avatar Nov 18 '25 15:11 pront

Thanks @thomasqueirozb and @pront I appreciate the efforts here. We're all on Mac silicone over here, so it has been a challenge. We're working on building a build server for all to use, but in the meantime I appreciate the help.

I am running the custom build in a test environment for a bit, then I'll move to our problematic environment and hopefully we'll have some results to share.

mscanlon72 avatar Nov 18 '25 18:11 mscanlon72

Hey @thomasqueirozb and @pront, after about 8 hours or so we saw three nodes stop processing logs. The update with the feature flag set to v1_30 did not work.

I will have to work on another PR to bump up libraries further to see if that can help.

mscanlon72 avatar Nov 19 '25 07:11 mscanlon72

Small update: check https://github.com/vectordotdev/vector/issues/12014#issuecomment-3560101781. It's possible this might help you if the issue you're facing is that namespace metadata is not available.

thomasqueirozb avatar Nov 20 '25 21:11 thomasqueirozb

@thomasqueirozb Thank you for your comment. We are not seeing the annotation issue in the cluster where the issue is reproducible.

Another interesting data point, we saw a vector pod recover from the issue. Hours later, after reading in no logs, all of a sudden the pod starting reading the logs again. This is something that we did not see (or notice) before. It isn't that pods on this node were not logging anything, I did check that. Lots of daemonset pods, chatty apps, etc were still emitting. We find this very interesting. Noticed this on two pods so far. Any additional thoughts here?

mscanlon72 avatar Nov 21 '25 19:11 mscanlon72

@mscanlon72 what I think to be the most likely culprit of this issue and the one I mentioned in my previous comment is that namespace metadata is either failing or very slow (this is likely a bug somewhere either in Vector or in the Kube crate). In your situation I'd strongly recommend setting insert_namespace_fields: false and checking if this mitigates the issue, even if you're not seeing any logs.

If you want to verify that this is the issue you can check that you are seeing https://github.com/vectordotdev/vector/blob/bc654a796aafd331ebae99c77b31ab83e38eb62d/src/sources/kubernetes_logs/k8s_paths_provider.rs#L59

but not seeing https://github.com/vectordotdev/vector/blob/bc654a796aafd331ebae99c77b31ab83e38eb62d/src/sources/kubernetes_logs/k8s_paths_provider.rs#L69

with your current configuration. Also keep in mind that if you are running Vector with trace logs without any filtering you might unfortunately run into #24220. See https://vector.dev/guides/developer/debugging/#controlling-log-verbosity for information about filtering If you see that this is the case then you should turn on insert_namespace_fields: false which would cause this issue to be resolved.


Side note: the current behavior is not ideal. I think that we need to come up with something or add another flag that would allow logs to flow through the source without namespace metadata. Biggest issue here is that we want to provide users with the best experience but also don't want to break existing users that rely on that data so it's unclear how we should handle this.

thomasqueirozb avatar Nov 21 '25 20:11 thomasqueirozb

@thomasqueirozb When the issue presents itself, we don't see either of those messages. When things are working properly, we see them both.

I will try insert_namespace_fields: false regardless.

Would use_apiserver_cache set to true be another option?

mscanlon72 avatar Nov 22 '25 01:11 mscanlon72

If the log verbosity only drops internal_logs, I'm less concerned with that. I have seen some beeps but not that many, but I guess that is relative. 76 node cluster, here is a little snippet.

2025-11-22T01:51:15.633084Z TRACE vector: Internal log [Beep.] has been suppressed 9 times.
2025-11-22T01:51:15.633097Z TRACE vector: Beep.
2025-11-22T01:51:12.671062Z TRACE vector: Internal log [Beep.] has been suppressed 9 times.
2025-11-22T01:51:12.671074Z TRACE vector: Beep.
2025-11-22T01:51:15.368998Z TRACE vector: Internal log [Beep.] is being suppressed to avoid flooding.
2025-11-22T01:51:13.316973Z TRACE vector: Internal log [Beep.] has been suppressed 9 times.
2025-11-22T01:51:13.316989Z TRACE vector: Beep.
2025-11-22T01:51:12.585449Z TRACE vector: Internal log [Beep.] is being suppressed to avoid flooding.
2025-11-22T01:51:19.712364Z TRACE vector: Internal log [Beep.] is being suppressed to avoid flooding.
2025-11-22T01:51:18.985752Z TRACE vector: Internal log [Beep.] is being suppressed to avoid flooding.

mscanlon72 avatar Nov 22 '25 01:11 mscanlon72

@pront @thomasqueirozb Can we use this issue to request that Vector officially update the k8s libraries? Vector is a bit behind. However, we do not see the issue on GKE or RKE running k8s versions up to 1.33.x.

mscanlon72 avatar Dec 03 '25 23:12 mscanlon72

@thomasqueirozb Setting insert_namespace_fields: false had no impact. Vector still fails to pull logs on some nodes. We're still seeing some vector pods that recover after failing for some hours.

mscanlon72 avatar Dec 05 '25 17:12 mscanlon72