manifests icon indicating copy to clipboard operation
manifests copied to clipboard

Limit Istio Sidecar Scope to reduce memory and make cluster more scalable

Open han-steve opened this issue 9 months ago • 8 comments

Pull Request Template for Kubeflow Manifests

✏️ Summary of Changes

Describe the changes you have made, including any refactoring or feature additions.

Adding a new istio sidecar resource to limit the sidecar's egress visibility to unnecessary services.

We (Roblox) have been running Kubeflow in production for a long time, and we are noticing that the istio sidecar memory is almost 1GB now due to the amount of services in the cluster that has to be cached in each sidecar. This adds up to over 2 TB of memory in total. This change limits the caching of cluster services in each sidecar, thus helping the scalability of the cluster.

This change can save TBs of memory and spare our DNS services. But I want to ask the community to see if there are any istio-enabled egress communication from kubeflow pods that we haven't considered. As far as we know Communications to Notebook and Pipeline backends go through the ingress gateway instead of directly inside the cluster, so that won't matter Communications to kserve models go through cluster ingress gateway All other CRD-based workloads don't need any egress communication

🐛 Related Issues

Link any issues that are resolved or affected by this PR.

https://github.com/knative/serving/issues/12917 We are facing this issue where each sidecar is pinging DNS to resolve the cluster ingressgateway ip, essentially DDOSing our DNS. Removing the ExternalName service for cluster ingress gateway from the sidecars would resolve this problem.

✅ Contributor Checklist


You can join the CNCF Slack and access our meetings at the Kubeflow Community website. Our channel on the CNCF Slack is here #kubeflow-platform.

Slack message link: https://cloud-native.slack.com/archives/C073W572LA2/p1741893411623659

han-steve avatar Mar 14 '25 18:03 han-steve

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign juliusvonkohout for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow[bot] avatar Mar 14 '25 18:03 google-oss-prow[bot]

Thank you for the PR.

/ok-to-test

juliusvonkohout avatar Mar 14 '25 18:03 juliusvonkohout

@tarekabouzeid do you mind testing this ?

juliusvonkohout avatar Mar 20 '25 08:03 juliusvonkohout

@madmecodes

juliusvonkohout avatar May 29 '25 17:05 juliusvonkohout

So we've actually been running this in production for quite a while. The only caveat is that the notebook controller cannot directly talk to notebooks in user namespaces to update the kernel activity. We fixed this by having another Sidecar CRD that only applies to the notebook controller and allow it to cache services in all namespaces

han-steve avatar May 29 '25 19:05 han-steve

So we've actually been running this in production for quite a while. The only caveat is that the notebook controller cannot directly talk to notebooks in user namespaces to update the kernel activity. We fixed this by having another Sidecar CRD that only applies to the notebook controller and allow it to cache services in all namespaces

Do you mind providing implementation details ?

juliusvonkohout avatar Jun 04 '25 09:06 juliusvonkohout

Sure.

apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
  name: istio-sidecar-prune-egress
  namespace: istio-system
spec:
  egress:
  - hosts:
    - ./*
    - ingress-nginx/*
    - ingress-nginx-serving/*
    - istio-system/*
    - kubeflow/*
    - kube-system/*
    - argocd/*
    - cert-manager/*
    - external-dns/*
    - monitoring/*

and

apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
  name: istio-sidecar-prune-egress
  namespace: kubeflow
spec:
  egress:
  - hosts:
    - '*/*'
  workloadSelector:
    labels:
      app: notebook-controller

han-steve avatar Jun 10 '25 01:06 han-steve

CC also @kimwnasptd @mvlassis since this could help with the performance as well. I am too busy right now, but will follow up.

juliusvonkohout avatar Jun 13 '25 10:06 juliusvonkohout

closed in favor of https://github.com/kubeflow/manifests/pull/3206

juliusvonkohout avatar Aug 05 '25 14:08 juliusvonkohout