Adding/fixing DNSProxy metrics
A few changes are made here:
cilium_fqdn_semaphore_rejected_totalwasn't being updated correctly, due to an incorrect error check. This is fixed now.- Based on the discussion here, the field
scope:datapathTimeincilium_proxy_upstream_reply_secondswas split into two different scopes:policyGenerationTime(for updating the DNS caches and policy caches) anddatapathTime(which include the async policy map updates and the identity cache updates).
Fixes `semaphore_rejected_total` metric and adds new `scope` to `proxy_upstream_reply_seconds` metric.
Signed-off-by: Rahul Joshi [email protected]
/assign @joestringer
@joestringer friendly ping here.
/test
/test
I'm not sure what's going on with the jenkins CI jobs, I've started a thread on Slack to follow up.
EDIT: Retrying below in case this is some transient upstream error with GitHub.
/test
It looks like the L4LB failure is introduced because the PR is not recently rebased.
We are actively investigating the Jenkins failures, they do not appear to be related to the PR content. Until we've resolved that issue, there is no need to take further action. Once we've resolved it, you can rebase the PR and we should be OK to proceed with running the tests again.
That issue affecting the jenkins tests should be resolved now. Please rebase, and I will kick off the full testsuite afterwards.
Ok rebase is done.
/test
Job 'Cilium-PR-K8s-1.23-kernel-5.4' failed:
Click to show.
Test Name
K8sDatapathConfig Host firewall With VXLAN
Failure Output
FAIL: Failed to reach 192.168.56.12:80 from testclient-tjqw9
If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.23-kernel-5.4 so I can create one.
Job 'Cilium-PR-K8s-1.25-kernel-net-next' failed:
Click to show.
Test Name
K8sDatapathConfig Check BPF masquerading with ip-masq-agent DirectRouting
Failure Output
FAIL: Failed to add ip route
If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.25-kernel-net-next so I can create one.
It looks like the previous test results have expired so we can't know the results. Can you rebase once more on tip of master and I'll re-kick CI to see if there are any reliable failures introduced by this PR?
/test
/test
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@joestringer @aanm can we merge this?
I think we're waiting for a green CI run. I can kick off the tests.
/test
The fuzz test seems to have OOM'd -- if I'm reading that correctly. And I think the codeql is failing due to some golang version mis-match?
Fuzz failure may be https://github.com/cilium/cilium/issues/22459 .
I don't know about the CodeQL error. It's not marked as a required job but could you file an issue for that failure? We don't need to block the PR based on it, I just want to make sure we're tracking failures across the codebase so that others can search for the same failure and identify whether it's relevant to their PR + collaborate on fixes.
net-next job failed vm provisioning, we can re-kick that.
/test-1.26-net-next
Filed https://github.com/cilium/cilium/issues/22488 -- couldn't find the zip file but I hope there's enough info there to start looking into it.
Travis looks like it only hit an issue on arm64, seems to be some upstream provider issue for the apt repository for dependencies in the test. This PR shouldn't impact that.
@rahulkjoshi was there anything meaningful in the latest change that you think may require CI to run to validate that? I didn't get a chance to see whether the last test I ran was successful or not, and now there's a fresh version the CI hasn't run for the PR any more.
I guess all the other jobs were passing or failing with known/acceptable failures, so only the net-next job needs to run again to check. I'll re-run just that one.
/test-1.26-net-next
The net-next run hit #22504 again. I've seen that in other PRs now too, so it's not related to this PR. Looking back over the changes, the risk of breakage on some specific k8s version or so on is very low, and previously we had almost all green test runs. I'm OK to merge this as-is.