alloy [loki.source.kubernetes] decrease logLevel of log msg

[loki.source.kubernetes] decrease logLevel of log msg

Open TheRealNoob opened this issue 6 months ago • 12 comments

What's wrong?

I am receiving the follow log message multiple times a second. This seems to be expected behavior from this component, however it's also expected behavior from my (dozens of ceph osd) pod. I feel the appropriate solution here would be to decrease this log to level=debug, or alternatively to allow configuration of it somehow.

ts=2024-01-17T01:28:07.837938857Z level=info msg="have not seen a log line in 3x average time between lines, closing and re-opening tailer" target=ceph/rook-ceph-osd-16-6d87ddf895-kghz8:osd component=loki.source.kubernetes.allPods rolling_average=2s time_since_last=6.739761385s

Steps to reproduce

System information

No response

Software version

docker.io/grafana/agent:v0.39.0

Configuration

Running agent using the helm chart in Daemonset Flow mode

Logs

NA

Jan 25 '24 23:01 TheRealNoob

How many times is this being logged for how many targets over 15 minutes?

Jan 26 '24 15:01 mattdurham

The screenshot is the results of this query - so it's showing how many log lines per minute. This is over a 3 hour timewindow. Note this is an empty cluster running just system components so the workload is quite light at 176 running pods. I included the tailer stopped; will retry log msg because it seems to be a reaction to the reported event (log msg).

sum by(level) (count_over_time({namespace="monitoring", pod=~"grafana-agent.+"} | logfmt | msg =~ `(have not seen a log line in 3x average time between lines, closing and re-opening tailer|tailer stopped; will retry)` [1m]))

Screenshot_20240126_131117

Jan 26 '24 21:01 TheRealNoob

It looks like 55 targets are graphed in this 3 hour window. So almost 1/3rd of my pods. That seems excessive.

on second thought, is this feature really that useful? can it be disabled or configured?

Jan 26 '24 23:01 TheRealNoob

It looks like 55 targets are graphed in this 3 hour window. So almost 1/3rd of my pods. That seems excessive.

on second thought, is this feature really that useful? can it be disabled or configured?

If your K8s version is < v1.29.1, it's required. Above that, it's not, see https://github.com/kubernetes/kubernetes/issues/115702 and https://github.com/grafana/agent/pull/5623

I made a PR to approach the issue. PTAL if you have time :D

Jan 27 '24 15:01 hainenber

Thank you for the explanation that clarifies a lot. I will test it out today/tomorrow. The only thing I spotted in the PR is that the log msg is still level=info. Is that intentional? Does it make sense to change it to debug?

Jan 29 '24 18:01 TheRealNoob

@hainenber I upgraded my cluster from 1.28.1 to 1.29.1 today and it looks to still be closing the connection after 3x avg duration. It seems to be doing it significantly less often, But I think that might be the pods; I don't think I had built the image and implemented it into my values file by the timestamp I'm seeing. I will follow up on this part. On that topic I couldn't find a pre-built image off your branch in the CICD so I built it myself. Below are the steps I took, perhaps I did it wrong?

git clone [email protected]:hainenber/agent.git
git checkout not-restart-tailers-for-k8s-v1.29.1+
DOCKER_BUILDKIT=1 docker build --file cmd/grafana-agent/Dockerfile -t <repo:tag> .

Second thing I noticed is that the kubernetes/kubernetes/pull/115702 PR looks to have been released in 1.29.0, not 1.29.1 (see changelog -- search for 115702).

Jan 30 '24 01:01 TheRealNoob

I agree this should be dropped down to debug

Jan 30 '24 14:01 mattdurham

@TheRealNoob @mattdurham thank you for the feedback! I've made the corrections accordingly.

Btw, re: building a Agent's image, I'd suggest using make agent-image :D (at least that's what I've been using)

Feb 05 '24 09:02 hainenber

Thank you @hainenber. I rebuilt my image using your latest commit and it seems to work as expected. However, looking at the code I think I see why it didn't work for me before (again i'm running 1.29.1) and that it's still not quite right. This line checks to see if the k8s version is less than or equal to 1.29. It should just be less than, since 1.29.0 is when the bug was fixed.

Second small thing is a few things like the changelog and a few comments need to be updated to reflect the above.

Thank you

Feb 13 '24 20:02 TheRealNoob

thanks @TheRealNoob for the testing and findings! I've done all items you've found to reflect them :D

Once again, thanks 🙏

Feb 15 '24 09:02 hainenber

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!

Mar 17 '24 00:03 github-actions[bot]

Hi there :wave:

On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.

To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)

Apr 11 '24 20:04 rfratto

alloy alloy copied to clipboard

[loki.source.kubernetes] decrease logLevel of log msg

What's wrong?

Steps to reproduce

System information

Software version

Configuration

Logs

alloy
alloy copied to clipboard