alloy
alloy copied to clipboard
[loki.source.kubernetes] decrease logLevel of log msg
What's wrong?
I am receiving the follow log message multiple times a second. This seems to be expected behavior from this component, however it's also expected behavior from my (dozens of ceph osd) pod. I feel the appropriate solution here would be to decrease this log to level=debug
, or alternatively to allow configuration of it somehow.
ts=2024-01-17T01:28:07.837938857Z level=info msg="have not seen a log line in 3x average time between lines, closing and re-opening tailer" target=ceph/rook-ceph-osd-16-6d87ddf895-kghz8:osd component=loki.source.kubernetes.allPods rolling_average=2s time_since_last=6.739761385s
Steps to reproduce
NA
System information
No response
Software version
docker.io/grafana/agent:v0.39.0
Configuration
Running agent using the helm chart in Daemonset Flow mode
Logs
NA
How many times is this being logged for how many targets over 15 minutes?
The screenshot is the results of this query - so it's showing how many log lines per minute. This is over a 3 hour timewindow. Note this is an empty cluster running just system components so the workload is quite light at 176 running pods. I included the tailer stopped; will retry
log msg because it seems to be a reaction to the reported event (log msg).
sum by(level) (count_over_time({namespace="monitoring", pod=~"grafana-agent.+"} | logfmt | msg =~ `(have not seen a log line in 3x average time between lines, closing and re-opening tailer|tailer stopped; will retry)` [1m]))
It looks like 55 targets are graphed in this 3 hour window. So almost 1/3rd of my pods. That seems excessive.
on second thought, is this feature really that useful? can it be disabled or configured?
It looks like 55 targets are graphed in this 3 hour window. So almost 1/3rd of my pods. That seems excessive.
on second thought, is this feature really that useful? can it be disabled or configured?
If your K8s version is < v1.29.1
, it's required. Above that, it's not, see https://github.com/kubernetes/kubernetes/issues/115702 and https://github.com/grafana/agent/pull/5623
I made a PR to approach the issue. PTAL if you have time :D
Thank you for the explanation that clarifies a lot. I will test it out today/tomorrow. The only thing I spotted in the PR is that the log msg is still level=info. Is that intentional? Does it make sense to change it to debug?
@hainenber I upgraded my cluster from 1.28.1
to 1.29.1
today and it looks to still be closing the connection after 3x avg duration. It seems to be doing it significantly less often, But I think that might be the pods; I don't think I had built the image and implemented it into my values file by the timestamp I'm seeing. I will follow up on this part. On that topic I couldn't find a pre-built image off your branch in the CICD so I built it myself. Below are the steps I took, perhaps I did it wrong?
git clone [email protected]:hainenber/agent.git
git checkout not-restart-tailers-for-k8s-v1.29.1+
DOCKER_BUILDKIT=1 docker build --file cmd/grafana-agent/Dockerfile -t <repo:tag> .
Second thing I noticed is that the kubernetes/kubernetes/pull/115702 PR looks to have been released in 1.29.0, not 1.29.1 (see changelog -- search for 115702
).
I agree this should be dropped down to debug
@TheRealNoob @mattdurham thank you for the feedback! I've made the corrections accordingly.
Btw, re: building a Agent's image, I'd suggest using make agent-image
:D (at least that's what I've been using)
Thank you @hainenber. I rebuilt my image using your latest commit and it seems to work as expected. However, looking at the code I think I see why it didn't work for me before (again i'm running 1.29.1) and that it's still not quite right. This line checks to see if the k8s version is less than or equal to 1.29. It should just be less than, since 1.29.0 is when the bug was fixed.
Second small thing is a few things like the changelog and a few comments need to be updated to reflect the above.
Thank you
thanks @TheRealNoob for the testing and findings! I've done all items you've found to reflect them :D
Once again, thanks 🙏
This issue has not had any activity in the past 30 days, so the needs-attention
label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention
label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!
Hi there :wave:
On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.
To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)