training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

"zero-trust" security / networking for training jobs

Open astefanutti opened this issue 11 months ago • 9 comments

What you would like to be added?

Secure, ideally by default, the data plane of the jobs managed by the training operator.

This would include:

  • The creation of NetworkPolicies that prevent ingress traffic to the training jobs, i.e., only intra-job Pod-to-Pod communication is allowed
  • The configuration of (m)TLS for Pod-to-Pod communication wherever possible, or provide some documentation on how to achieve it, possibly using external solution like a service mesh for example.

Why is this needed?

In multi-tenant setups, it's important to guarantee tenants are isolated from each other.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

astefanutti avatar Nov 29 '24 13:11 astefanutti

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Feb 27 '25 15:02 github-actions[bot]

Thank you for creating this @astefanutti! Creating a dedicated page for Kubeflow Trainer operators would be nice: https://www.kubeflow.org/docs/components/trainer/operator-guides/

/area docs /remove-label lifecycle/needs-triage /remove-lifecycle stale

andreyvelich avatar Feb 27 '25 15:02 andreyvelich

cc @kubeflow/wg-manifests-leads @juliusvonkohout from the security point of view.

andreyvelich avatar Feb 27 '25 15:02 andreyvelich

I think istio support with mtls for the trainer component would already cover your needs.

juliusvonkohout avatar Feb 27 '25 16:02 juliusvonkohout

I think istio support with mtls for the trainer component would already cover your needs.

I wonder if Istio would support direct GPU interconnect and other high-performance network fabric.

astefanutti avatar Feb 27 '25 17:02 astefanutti

I think istio support with mtls for the trainer component would already cover your needs.

I wonder if Istio would support direct GPU interconnect and other high-performance network fabric.

Are sou sure that this even runs trough the Kubernetes network stack? Shouldn't this happen at lower levels?

juliusvonkohout avatar Feb 27 '25 17:02 juliusvonkohout

I think istio support with mtls for the trainer component would already cover your needs.

I wonder if Istio would support direct GPU interconnect and other high-performance network fabric.

In my experience, this is so challenging. HPC cluster often directly uses SRIO-V VirtualFunctions generated by physical interconnect devices.

tenzen-y avatar Feb 27 '25 17:02 tenzen-y

I think istio support with mtls for the trainer component would already cover your needs.

I wonder if Istio would support direct GPU interconnect and other high-performance network fabric.

Are sou sure that this even runs trough the Kubernetes network stack? Shouldn't this happen at lower levels?

You're right that's not always the case. With RoCE, collective communication still flow over Ethernet, possibly on secondary network interfaces.

Having said that, I'd be curious to have some data on how relevant encryption is for collective communication, and if the performance hit isn't just too high vs. the benefits.

astefanutti avatar Feb 27 '25 17:02 astefanutti

I think istio support with mtls for the trainer component would already cover your needs.

I wonder if Istio would support direct GPU interconnect and other high-performance network fabric.

Are sou sure that this even runs trough the Kubernetes network stack? Shouldn't this happen at lower levels?

You're right that's not always the case. With RoCE, collective communication still flow over Ethernet, possibly on secondary network interfaces.

Having said that, I'd be curious to have some data on how relevant encryption is for collective communication, and if the performance hit isn't just too high vs. the benefits.

I know NVIDIA spectrum switch supports offloading collective communication. It might be able to make communications secure and improve collective communication performance, but I did not actually verify the behavior.

tenzen-y avatar Feb 27 '25 19:02 tenzen-y

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar May 28 '25 20:05 github-actions[bot]

/remove-lifecycle stale

andreyvelich avatar May 28 '25 20:05 andreyvelich

So on deman network policies that reference the job as owner will be automatically garbage collected with the job. That should be good enough and also usable in a stand-alone mode, so without kubeflow platform. If you want encryption, then istio and authorization policies within kubeflow platform is probably necessary but should be opt-in.

juliusvonkohout avatar May 29 '25 16:05 juliusvonkohout

So on deman network policies that reference the job as owner will be automatically garbage collected with the job. That should be good enough

I agree. NetworkPolicies are being discussed by the AI Conformance WG as a requirement for CNCF Kubernetes AI conformance.

Would that make sense to provide an option in the trainer so a NetworkPolicy (owned by the parent TrainJob) is automatically created if enabled?

By default that NetworkPolicy would only allow ingresses from intra-job nodes/pods.

astefanutti avatar Aug 07 '25 09:08 astefanutti

So on deman network policies that reference the job as owner will be automatically garbage collected with the job. That should be good enough

I agree. NetworkPolicies are being discussed by the AI Conformance WG as a requirement for CNCF Kubernetes AI conformance.

Would that make sense to provide an option in the trainer so a NetworkPolicy (owned by the parent TrainJob) is automatically created if enabled?

By default that NetworkPolicy would only allow ingresses from intra-job nodes/pods.

Not just an option, this should be enforced by default.

juliusvonkohout avatar Aug 07 '25 10:08 juliusvonkohout

Not just an option, this should be enforced by default.

I agree, +1 to this.

astefanutti avatar Aug 07 '25 12:08 astefanutti

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Nov 05 '25 15:11 github-actions[bot]

/remove-lifecycle stale

astefanutti avatar Nov 05 '25 15:11 astefanutti

/good-first-issue

andreyvelich avatar Nov 05 '25 15:11 andreyvelich

@andreyvelich: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Nov 05 '25 15:11 google-oss-prow[bot]

I would like to take on this issue

Garvit-77 avatar Nov 05 '25 18:11 Garvit-77

/assign

Garvit-77 avatar Nov 05 '25 18:11 Garvit-77

I would like to take on this issue

If there is no PR just raise one and work with the trainer mainntainers here and on slack.

juliusvonkohout avatar Nov 10 '25 14:11 juliusvonkohout

I have a clear idea for applying ingress policies and Istio sidecars.

However, the main issue raised here is that during HPC jobs, the system interacts with hardware (such as SR-IOV), which directly communicates with physical NICs and bypasses the CNI. This means isolation would need to rely on switch configurations, VLANs, and ACLs using NVIDIA Spectrum, as mentioned in this comment.

Let me know your opinions—I'd like to start a trial with Spectrum based on that.

Garvit-77 avatar Nov 11 '25 11:11 Garvit-77

I have a clear idea for applying ingress policies and Istio sidecars.

However, the main issue raised here is that during HPC jobs, the system interacts with hardware (such as SR-IOV), which directly communicates with physical NICs and bypasses the CNI. This means isolation would need to rely on switch configurations, VLANs, and ACLs using NVIDIA Spectrum, as mentioned in this comment.

Let me know your opinions—I'd like to start a trial with Spectrum based on that.

Just providing networkpolicies owned by a higher level resource (for automatic deletion) as a fist step should be enough. Afterwards you you can think about istio.

juliusvonkohout avatar Nov 13 '25 17:11 juliusvonkohout

I have a clear idea for applying ingress policies and Istio sidecars. However, the main issue raised here is that during HPC jobs, the system interacts with hardware (such as SR-IOV), which directly communicates with physical NICs and bypasses the CNI. This means isolation would need to rely on switch configurations, VLANs, and ACLs using NVIDIA Spectrum, as mentioned in this comment. Let me know your opinions—I'd like to start a trial with Spectrum based on that.

Just providing networkpolicies owned by a higher level resource as a fist step should be enough. Afterwards you you can think about istio.

+1 to what @juliusvonkohout suggests. NWPs that only allow pod-to-pod communication intra TrainJobs to secure PyTorch distributed runtime would be enough. We can iterate further to cover more advanced use-cases.

astefanutti avatar Nov 13 '25 18:11 astefanutti