training-operator
training-operator copied to clipboard
"zero-trust" security / networking for training jobs
What you would like to be added?
Secure, ideally by default, the data plane of the jobs managed by the training operator.
This would include:
- The creation of NetworkPolicies that prevent ingress traffic to the training jobs, i.e., only intra-job Pod-to-Pod communication is allowed
- The configuration of (m)TLS for Pod-to-Pod communication wherever possible, or provide some documentation on how to achieve it, possibly using external solution like a service mesh for example.
Why is this needed?
In multi-tenant setups, it's important to guarantee tenants are isolated from each other.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Thank you for creating this @astefanutti! Creating a dedicated page for Kubeflow Trainer operators would be nice: https://www.kubeflow.org/docs/components/trainer/operator-guides/
/area docs /remove-label lifecycle/needs-triage /remove-lifecycle stale
cc @kubeflow/wg-manifests-leads @juliusvonkohout from the security point of view.
I think istio support with mtls for the trainer component would already cover your needs.
I think istio support with mtls for the trainer component would already cover your needs.
I wonder if Istio would support direct GPU interconnect and other high-performance network fabric.
I think istio support with mtls for the trainer component would already cover your needs.
I wonder if Istio would support direct GPU interconnect and other high-performance network fabric.
Are sou sure that this even runs trough the Kubernetes network stack? Shouldn't this happen at lower levels?
I think istio support with mtls for the trainer component would already cover your needs.
I wonder if Istio would support direct GPU interconnect and other high-performance network fabric.
In my experience, this is so challenging. HPC cluster often directly uses SRIO-V VirtualFunctions generated by physical interconnect devices.
I think istio support with mtls for the trainer component would already cover your needs.
I wonder if Istio would support direct GPU interconnect and other high-performance network fabric.
Are sou sure that this even runs trough the Kubernetes network stack? Shouldn't this happen at lower levels?
You're right that's not always the case. With RoCE, collective communication still flow over Ethernet, possibly on secondary network interfaces.
Having said that, I'd be curious to have some data on how relevant encryption is for collective communication, and if the performance hit isn't just too high vs. the benefits.
I think istio support with mtls for the trainer component would already cover your needs.
I wonder if Istio would support direct GPU interconnect and other high-performance network fabric.
Are sou sure that this even runs trough the Kubernetes network stack? Shouldn't this happen at lower levels?
You're right that's not always the case. With RoCE, collective communication still flow over Ethernet, possibly on secondary network interfaces.
Having said that, I'd be curious to have some data on how relevant encryption is for collective communication, and if the performance hit isn't just too high vs. the benefits.
I know NVIDIA spectrum switch supports offloading collective communication. It might be able to make communications secure and improve collective communication performance, but I did not actually verify the behavior.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/remove-lifecycle stale
So on deman network policies that reference the job as owner will be automatically garbage collected with the job. That should be good enough and also usable in a stand-alone mode, so without kubeflow platform. If you want encryption, then istio and authorization policies within kubeflow platform is probably necessary but should be opt-in.
So on deman network policies that reference the job as owner will be automatically garbage collected with the job. That should be good enough
I agree. NetworkPolicies are being discussed by the AI Conformance WG as a requirement for CNCF Kubernetes AI conformance.
Would that make sense to provide an option in the trainer so a NetworkPolicy (owned by the parent TrainJob) is automatically created if enabled?
By default that NetworkPolicy would only allow ingresses from intra-job nodes/pods.
So on deman network policies that reference the job as owner will be automatically garbage collected with the job. That should be good enough
I agree. NetworkPolicies are being discussed by the AI Conformance WG as a requirement for CNCF Kubernetes AI conformance.
Would that make sense to provide an option in the trainer so a NetworkPolicy (owned by the parent TrainJob) is automatically created if enabled?
By default that NetworkPolicy would only allow ingresses from intra-job nodes/pods.
Not just an option, this should be enforced by default.
Not just an option, this should be enforced by default.
I agree, +1 to this.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/remove-lifecycle stale
/good-first-issue
@andreyvelich: This request has been marked as suitable for new contributors.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.
In response to this:
/good-first-issue
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
I would like to take on this issue
/assign
I would like to take on this issue
If there is no PR just raise one and work with the trainer mainntainers here and on slack.
I have a clear idea for applying ingress policies and Istio sidecars.
However, the main issue raised here is that during HPC jobs, the system interacts with hardware (such as SR-IOV), which directly communicates with physical NICs and bypasses the CNI. This means isolation would need to rely on switch configurations, VLANs, and ACLs using NVIDIA Spectrum, as mentioned in this comment.
Let me know your opinions—I'd like to start a trial with Spectrum based on that.
I have a clear idea for applying ingress policies and Istio sidecars.
However, the main issue raised here is that during HPC jobs, the system interacts with hardware (such as SR-IOV), which directly communicates with physical NICs and bypasses the CNI. This means isolation would need to rely on switch configurations, VLANs, and ACLs using NVIDIA Spectrum, as mentioned in this comment.
Let me know your opinions—I'd like to start a trial with Spectrum based on that.
Just providing networkpolicies owned by a higher level resource (for automatic deletion) as a fist step should be enough. Afterwards you you can think about istio.
I have a clear idea for applying ingress policies and Istio sidecars. However, the main issue raised here is that during HPC jobs, the system interacts with hardware (such as SR-IOV), which directly communicates with physical NICs and bypasses the CNI. This means isolation would need to rely on switch configurations, VLANs, and ACLs using NVIDIA Spectrum, as mentioned in this comment. Let me know your opinions—I'd like to start a trial with Spectrum based on that.
Just providing networkpolicies owned by a higher level resource as a fist step should be enough. Afterwards you you can think about istio.
+1 to what @juliusvonkohout suggests. NWPs that only allow pod-to-pod communication intra TrainJobs to secure PyTorch distributed runtime would be enough. We can iterate further to cover more advanced use-cases.