apis
apis copied to clipboard
Proposal:add UnhealthyNodeNames feild in HyperNode status and common HyperNodeConditionType
What is the problem you're trying to solve
- HyperNode functions similarly to a switch or tor, requiring switch vendors to update both the
SpecandStatusfields on the HyperNode. Currently, theStatusfield only containsConditionsandNodeCount. - When an RDMA network card issue occurs on a node connected to the leaf switch or tor —such as high BER (Bit Error Rate) or link flapping (as discussed in the paper "RDMA over Ethernet for Distributed Training at Meta Scale")—vendors can only update the network card status in the
Conditionsfield. - The
Conditionsfield reflects the overall status of the HyperNode, which may be connected to multiple nodes. However, there is no existing mechanism to indicate the health status of individual nodes. This lack of granularity prevents the scheduler from accurately identifying and handling unhealthy nodes. - Since HyperNode essentially functions as a switch or tor, it is also necessary to introduce standard switch condition types for vendors to use in
HyperNodeStatus.Conditions.
Describe the solution you'd like
- Introduce a new field under
Status, such asUnhealthyNodeNames, to explicitly list nodes that are currently unschedulable under the given HyperNode. - Define two common switch condition types to standardize switch health reporting:
HyperNodeSystemFailure: Indicates a system-level issue on the switch or tor, such as CPU or memory overload, power failure, fan malfunction, or other critical system faults.HyperNodeNetworkUnavailable: Indicates a network-related issue on the switch or tor, such as abnormal link status, interface failures, or other network disruptions.
Expected HyperNode Structure
The final HyperNode status should incorporate these enhancements to provide granular node-level health insights and common switch condition types, improving scheduler awareness and resource allocation efficiency.
The final expected HyperNode is as follows:
apiVersion: topology.volcano.sh/v1alpha1
kind: HyperNode
metadata:
creationTimestamp: "2025-02-05T09:35:50Z"
generation: 2
name: leaf1
resourceVersion: "341389665"
uid: 0be6f513-0c58-4845-97e9-7da84f04a4d4
spec:
members:
- selector:
exactMatch:
name: worker-28
type: Node
- selector:
exactMatch:
name: worker-29
type: Node
- selector:
exactMatch:
name: worker-30
type: Node
tier: 1
status:
conditions:
- lastTransitionTime: "2025-02-10T07:41:38Z"
message: There are network-related problems with the switch
reason: OPTICAL_LINK_SUBHEALTH_FourHundredGigE1_0_9
status: "True"
type: NetworkUnavailable
- lastTransitionTime: "2025-02-10T07:41:38Z"
message: The switch is healthy
reason: SwitchSystemIsHealthy
status: "False"
type: SystemFailure
unhealthyNodeNames:
- worker-28
- worker-29
nodeCount: 3
Welcome @fishingfly!
It looks like this is your first PR to volcano-sh/apis.
Thank you, and welcome to Volcano. :smiley:
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by:
To complete the pull request process, please assign kevin-wangzefeng
You can assign the PR to them by writing /assign @kevin-wangzefeng in a comment when ready.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Hi @Monokaix As we discussed offline on Tuesday, this is the initial proposal.
/lgtm