cilium icon indicating copy to clipboard operation
cilium copied to clipboard

Adding/fixing DNSProxy metrics

Open rahulkjoshi opened this issue 3 years ago • 7 comments

A few changes are made here:

  • cilium_fqdn_semaphore_rejected_total wasn't being updated correctly, due to an incorrect error check. This is fixed now.
  • Based on the discussion here, the field scope:datapathTime in cilium_proxy_upstream_reply_seconds was split into two different scopes: policyGenerationTime (for updating the DNS caches and policy caches) and datapathTime (which include the async policy map updates and the identity cache updates).
Fixes `semaphore_rejected_total` metric and adds new `scope` to `proxy_upstream_reply_seconds` metric.

Signed-off-by: Rahul Joshi [email protected]

rahulkjoshi avatar Sep 09 '22 15:09 rahulkjoshi

/assign @joestringer

rahulkjoshi avatar Sep 09 '22 15:09 rahulkjoshi

@joestringer friendly ping here.

rahulkjoshi avatar Sep 14 '22 18:09 rahulkjoshi

/test

joestringer avatar Sep 20 '22 21:09 joestringer

/test

joestringer avatar Sep 20 '22 21:09 joestringer

I'm not sure what's going on with the jenkins CI jobs, I've started a thread on Slack to follow up.

EDIT: Retrying below in case this is some transient upstream error with GitHub.

joestringer avatar Sep 20 '22 23:09 joestringer

/test

joestringer avatar Sep 21 '22 13:09 joestringer

It looks like the L4LB failure is introduced because the PR is not recently rebased.

We are actively investigating the Jenkins failures, they do not appear to be related to the PR content. Until we've resolved that issue, there is no need to take further action. Once we've resolved it, you can rebase the PR and we should be OK to proceed with running the tests again.

joestringer avatar Sep 21 '22 17:09 joestringer

That issue affecting the jenkins tests should be resolved now. Please rebase, and I will kick off the full testsuite afterwards.

joestringer avatar Sep 22 '22 12:09 joestringer

Ok rebase is done.

rahulkjoshi avatar Sep 22 '22 16:09 rahulkjoshi

/test

Job 'Cilium-PR-K8s-1.23-kernel-5.4' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN

Failure Output

FAIL: Failed to reach 192.168.56.12:80 from testclient-tjqw9

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.23-kernel-5.4 so I can create one.

Job 'Cilium-PR-K8s-1.25-kernel-net-next' failed:

Click to show.

Test Name

K8sDatapathConfig Check BPF masquerading with ip-masq-agent DirectRouting

Failure Output

FAIL: Failed to add ip route

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.25-kernel-net-next so I can create one.

joestringer avatar Sep 22 '22 16:09 joestringer

It looks like the previous test results have expired so we can't know the results. Can you rebase once more on tip of master and I'll re-kick CI to see if there are any reliable failures introduced by this PR?

joestringer avatar Oct 19 '22 18:10 joestringer

/test

aanm avatar Oct 21 '22 15:10 aanm

/test

aanm avatar Oct 21 '22 15:10 aanm

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Nov 21 '22 02:11 github-actions[bot]

@joestringer @aanm can we merge this?

rahulkjoshi avatar Nov 30 '22 20:11 rahulkjoshi

I think we're waiting for a green CI run. I can kick off the tests.

joestringer avatar Nov 30 '22 20:11 joestringer

/test

joestringer avatar Nov 30 '22 20:11 joestringer

The fuzz test seems to have OOM'd -- if I'm reading that correctly. And I think the codeql is failing due to some golang version mis-match?

rahulkjoshi avatar Nov 30 '22 22:11 rahulkjoshi

Fuzz failure may be https://github.com/cilium/cilium/issues/22459 .

I don't know about the CodeQL error. It's not marked as a required job but could you file an issue for that failure? We don't need to block the PR based on it, I just want to make sure we're tracking failures across the codebase so that others can search for the same failure and identify whether it's relevant to their PR + collaborate on fixes.

net-next job failed vm provisioning, we can re-kick that.

joestringer avatar Dec 01 '22 01:12 joestringer

/test-1.26-net-next

joestringer avatar Dec 01 '22 01:12 joestringer

Filed https://github.com/cilium/cilium/issues/22488 -- couldn't find the zip file but I hope there's enough info there to start looking into it.

rahulkjoshi avatar Dec 01 '22 15:12 rahulkjoshi

Travis looks like it only hit an issue on arm64, seems to be some upstream provider issue for the apt repository for dependencies in the test. This PR shouldn't impact that.

@rahulkjoshi was there anything meaningful in the latest change that you think may require CI to run to validate that? I didn't get a chance to see whether the last test I ran was successful or not, and now there's a fresh version the CI hasn't run for the PR any more.

joestringer avatar Dec 01 '22 17:12 joestringer

I guess all the other jobs were passing or failing with known/acceptable failures, so only the net-next job needs to run again to check. I'll re-run just that one.

joestringer avatar Dec 01 '22 17:12 joestringer

/test-1.26-net-next

joestringer avatar Dec 01 '22 17:12 joestringer

The net-next run hit #22504 again. I've seen that in other PRs now too, so it's not related to this PR. Looking back over the changes, the risk of breakage on some specific k8s version or so on is very low, and previously we had almost all green test runs. I'm OK to merge this as-is.

joestringer avatar Dec 01 '22 21:12 joestringer