Limit Istio Sidecar Scope to reduce memory and make cluster more scalable
Pull Request Template for Kubeflow Manifests
✏️ Summary of Changes
Describe the changes you have made, including any refactoring or feature additions.
Adding a new istio sidecar resource to limit the sidecar's egress visibility to unnecessary services.
We (Roblox) have been running Kubeflow in production for a long time, and we are noticing that the istio sidecar memory is almost 1GB now due to the amount of services in the cluster that has to be cached in each sidecar. This adds up to over 2 TB of memory in total. This change limits the caching of cluster services in each sidecar, thus helping the scalability of the cluster.
This change can save TBs of memory and spare our DNS services. But I want to ask the community to see if there are any istio-enabled egress communication from kubeflow pods that we haven't considered. As far as we know Communications to Notebook and Pipeline backends go through the ingress gateway instead of directly inside the cluster, so that won't matter Communications to kserve models go through cluster ingress gateway All other CRD-based workloads don't need any egress communication
🐛 Related Issues
Link any issues that are resolved or affected by this PR.
https://github.com/knative/serving/issues/12917 We are facing this issue where each sidecar is pinging DNS to resolve the cluster ingressgateway ip, essentially DDOSing our DNS. Removing the ExternalName service for cluster ingress gateway from the sidecars would resolve this problem.
✅ Contributor Checklist
- [x] I have tested these changes with kustomize. See Installation Prerequisites.
- [x] All commits are signed-off to satisfy the DCO check.
You can join the CNCF Slack and access our meetings at the Kubeflow Community website. Our channel on the CNCF Slack is here #kubeflow-platform.
Slack message link: https://cloud-native.slack.com/archives/C073W572LA2/p1741893411623659
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign juliusvonkohout for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Thank you for the PR.
/ok-to-test
@tarekabouzeid do you mind testing this ?
@madmecodes
So we've actually been running this in production for quite a while. The only caveat is that the notebook controller cannot directly talk to notebooks in user namespaces to update the kernel activity. We fixed this by having another Sidecar CRD that only applies to the notebook controller and allow it to cache services in all namespaces
So we've actually been running this in production for quite a while. The only caveat is that the notebook controller cannot directly talk to notebooks in user namespaces to update the kernel activity. We fixed this by having another Sidecar CRD that only applies to the notebook controller and allow it to cache services in all namespaces
Do you mind providing implementation details ?
Sure.
apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
name: istio-sidecar-prune-egress
namespace: istio-system
spec:
egress:
- hosts:
- ./*
- ingress-nginx/*
- ingress-nginx-serving/*
- istio-system/*
- kubeflow/*
- kube-system/*
- argocd/*
- cert-manager/*
- external-dns/*
- monitoring/*
and
apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
name: istio-sidecar-prune-egress
namespace: kubeflow
spec:
egress:
- hosts:
- '*/*'
workloadSelector:
labels:
app: notebook-controller
CC also @kimwnasptd @mvlassis since this could help with the performance as well. I am too busy right now, but will follow up.
closed in favor of https://github.com/kubeflow/manifests/pull/3206