apis icon indicating copy to clipboard operation
apis copied to clipboard

Proposal:add UnhealthyNodeNames feild in HyperNode status and common HyperNodeConditionType

Open fishingfly opened this issue 9 months ago • 4 comments

What is the problem you're trying to solve

  • HyperNode functions similarly to a switch or tor, requiring switch vendors to update both the Spec and Status fields on the HyperNode. Currently, the Status field only contains Conditions and NodeCount.
  • When an RDMA network card issue occurs on a node connected to the leaf switch or tor —such as high BER (Bit Error Rate) or link flapping (as discussed in the paper "RDMA over Ethernet for Distributed Training at Meta Scale")—vendors can only update the network card status in the Conditions field.
  • The Conditions field reflects the overall status of the HyperNode, which may be connected to multiple nodes. However, there is no existing mechanism to indicate the health status of individual nodes. This lack of granularity prevents the scheduler from accurately identifying and handling unhealthy nodes.
  • Since HyperNode essentially functions as a switch or tor, it is also necessary to introduce standard switch condition types for vendors to use in HyperNodeStatus.Conditions.

Describe the solution you'd like

  • Introduce a new field under Status, such as UnhealthyNodeNames, to explicitly list nodes that are currently unschedulable under the given HyperNode.
  • Define two common switch condition types to standardize switch health reporting:
    • HyperNodeSystemFailure: Indicates a system-level issue on the switch or tor, such as CPU or memory overload, power failure, fan malfunction, or other critical system faults.
    • HyperNodeNetworkUnavailable: Indicates a network-related issue on the switch or tor, such as abnormal link status, interface failures, or other network disruptions.

Expected HyperNode Structure

The final HyperNode status should incorporate these enhancements to provide granular node-level health insights and common switch condition types, improving scheduler awareness and resource allocation efficiency.

The final expected HyperNode is as follows:

apiVersion: topology.volcano.sh/v1alpha1
kind: HyperNode
metadata:
  creationTimestamp: "2025-02-05T09:35:50Z"
  generation: 2
  name: leaf1
  resourceVersion: "341389665"
  uid: 0be6f513-0c58-4845-97e9-7da84f04a4d4
spec:
  members:
  - selector:
      exactMatch:
        name: worker-28
    type: Node
  - selector:
      exactMatch:
        name: worker-29
    type: Node
  - selector:
      exactMatch:
        name: worker-30
    type: Node
  tier: 1
status:
  conditions:
  - lastTransitionTime: "2025-02-10T07:41:38Z"
    message: There are network-related problems with the switch
    reason: OPTICAL_LINK_SUBHEALTH_FourHundredGigE1_0_9
    status: "True"
    type: NetworkUnavailable
  - lastTransitionTime: "2025-02-10T07:41:38Z"
    message: The switch is healthy
    reason: SwitchSystemIsHealthy
    status: "False"
    type: SystemFailure
  unhealthyNodeNames:
    - worker-28
    - worker-29
  nodeCount: 3

fishingfly avatar Feb 13 '25 06:02 fishingfly

Welcome @fishingfly!

It looks like this is your first PR to volcano-sh/apis.

Thank you, and welcome to Volcano. :smiley:

volcano-sh-bot avatar Feb 13 '25 06:02 volcano-sh-bot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please assign kevin-wangzefeng You can assign the PR to them by writing /assign @kevin-wangzefeng in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot avatar Feb 13 '25 06:02 volcano-sh-bot

Hi @Monokaix As we discussed offline on Tuesday, this is the initial proposal.

yeahdongcn avatar Feb 13 '25 06:02 yeahdongcn

/lgtm

Monokaix avatar Mar 26 '25 01:03 Monokaix