network-observability-operator
network-observability-operator copied to clipboard
Network policy fix for HyperShift
Description
In HyperShift deployments and other external control plane configurations, the flowlogs-pipeline pods were unable to connect to the Kubernetes API server, resulting in timeout errors:
Cannot get dynamic config: Get "https://172.30.0.1:443/api/v1/namespaces/netobserv/configmaps/flowlogs-pipeline-config-dynamic": dial tcp 172.30.0.1:443: i/o timeout Failed to watch err="failed to list *v1.Pod: Get https://172.30.0.1:443/api/v1/pods?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout"
In HyperShift deployments, the control plane (API server, etcd, controllers) runs in a separate management cluster, while worker nodes run in a different cluster. This is evident from:
- HyperShift label: The
kubernetesservice endpoint has the labelhypershift.openshift.io/managed: "true" - No local pods: The
defaultnamespace contains no pods serving the API server endpoint
The previous NetworkPolicy only allowed egress traffic to port 6443 when targeting specific in-cluster namespaces (openshift-apiserver, openshift-kube-apiserver) using namespaceSelector with
podSelector. However, external control plane endpoints are not represented as pods within the worker cluster, so the namespace selector rules don't apply and traffic is blocked.
NB: this issue is specific to main branch; release-1.10 is unaffected
Dependencies
n/a
Checklist
If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.
- [ ] Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
- [ ] Does this PR require product documentation?
- [ ] If so, make sure the JIRA epic is labeled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
- [ ] Does this PR require a product release notes entry?
- [ ] If so, fill in "Release Note Text" in the JIRA.
- [ ] Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
- [ ] If so, make sure it is described in the JIRA ticket.
- QE requirements (check 1 from the list):
- [x] Standard QE validation, with pre-merge tests unless stated otherwise.
- [ ] Regression tests only (e.g. refactoring with no user-facing change).
- [ ] No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from leandroberetta. For more information see the Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 73.05%. Comparing base (6613e89) to head (ae5996d).
:warning: Report is 1 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #2157 +/- ##
==========================================
+ Coverage 73.02% 73.05% +0.02%
==========================================
Files 80 80
Lines 9180 9177 -3
==========================================
Hits 6704 6704
+ Misses 2062 2060 -2
+ Partials 414 413 -1
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 73.05% <ø> (+0.02%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Files with missing lines | Coverage Δ | |
|---|---|---|
| internal/controller/networkpolicy/np_objects.go | 92.25% <ø> (-0.15%) |
:arrow_down: |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Hmm, this weakens considerably the egress policy There should be another way, no? We're surely not the first having to deal with that
Hmm, this weakens considerably the egress policy There should be another way, no? We're surely not the first having to deal with that
I will investigate this issue. I think that we may want to watch the endpoint IP and add it to the network policy, but that would be something more complex, I don't think that ip would change, maybe we can do it one time, although if it changes, won't work, so we may want to watch that endpoint.
@leandroberetta yes, I asked the ovn folks, restricting on 172.20.0.1:6443 should work, and like you said we can know this address by watching the endpoint. The only downside I see, is that the Endpoint API is deprecated so sooner or later it will break; but ok, we will fix it when that happens, I guess EDIT: I see it's also defined as an EndpointSlice, so we can use the new API
New changes are detected. LGTM label has been removed.
@leandroberetta yes, I asked the ovn folks, restricting on 172.20.0.1:6443 should work, and like you said we can know this address by watching the endpoint. The only downside I see, is that the Endpoint API is deprecated so sooner or later it will break; but ok, we will fix it when that happens, I guess EDIT: I see it's also defined as an EndpointSlice, so we can use the new API
Hi @jotak, good to know, yes, I implemented a version where first looks for EndpointSlice and fallback to Endpoint in case the new API is not available.
It's working fine for HyperShift, although I'm not sure how to properly try if the ip changes (I'm not sure is a common case or even possible). Same for IPv6 clusters (network policy needs CIDR so the /32 or /128 in the range needs to be provided accordingly).
@leandroberetta: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| ci/prow/e2e-operator | c6fa84178a6b09bdab902a523b90032001acf304 | link | false | /test e2e-operator |
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.