longhorn-manager icon indicating copy to clipboard operation
longhorn-manager copied to clipboard

fix: check if environment settings are achieved

Open mantissahz opened this issue 1 year ago • 2 comments

Which issue(s) this PR fixes:

Issue # longhorn/longhorn#7571

What this PR does / why we need it:

Check if the environment settings are achieved, ref: environment_check.sh

Special notes for your reviewer:

Additional documentation or context

mantissahz avatar Apr 29 '24 03:04 mantissahz

The check is enforced when starting the Longhorn Manager daemon. If any condition fails, the Longhorn Manager will enter a loop of Starting and CrashLoopBackOff.

For users who only require RWO volumes, the NFS client is unnecessary. Then, the failure doesn't make sense.

Instead of performing checks only during startup, we can introduce conditions in nodes.longhorn.io to record the status of necessary packages, utilities, and demons. These conditions can be periodically checked by the node controller. This approach will help users stay informed and diagnose the Longhorn system effectively.

WDYT @c3y1huang @shuo-wu @PhanLe1010 @innobead

derekbit avatar Jul 03 '24 02:07 derekbit

Instead of performing checks only during startup, we can introduce conditions in nodes.longhorn.io to record the status of necessary packages, utilities, and demons. These conditions can be periodically checked by the node controller. This approach will help users stay informed and diagnose the Longhorn system effectively.

I agree with this design! We already check the mountProagation and record it inside the node condition:

  status:
    autoEvicting: false
    conditions:
    - lastProbeTime: ""
      lastTransitionTime: "2024-06-13T02:22:19Z"
      message: ""
      reason: ""
      status: "True"
      type: MountPropagation

We can do something similar for this new EnviromentCheck condition

PhanLe1010 avatar Jul 04 '24 01:07 PhanLe1010

Hi @derekbit, @PhanLe1010,

I have updated the PR to use the node conditions to record the environment setting. Could you review the PR when you are available?

mantissahz avatar Jul 08 '24 01:07 mantissahz

cc @ChanYiLin for helping review it if having time. Thank you.

derekbit avatar Jul 08 '24 04:07 derekbit

There is an exsting environment check for open-iscsi in the app/daemon.go. We already have a condition in node.longhorn.io for it. Do we still need to error out if open-iscsi is missing while staring up the daemon?

UPDATE: after discussing with @mantissahz , the existing environmentCheck function will be retained and called in longhorn-manager's subcommand daemon and pre-upgrade, and the function only checks the critical component open-iscsi. For other packages, utilties, and multipathd, they will be periodcally checked in the node controller.

cc @PhanLe1010 @shuo-wu @innobead WDYT?

derekbit avatar Jul 09 '24 02:07 derekbit

Ther kernel check function is not easy to read and maintain. What we need to check is. the four kernels

Distro Broken Version
Vanilla kernel 6.5.6
Ubuntu 5.15.0-94
Ubuntu 6.5.0-21
Ubuntu 6.5.0-1014-aws

Can we check it use string compare?

derekbit avatar Jul 09 '24 12:07 derekbit

As discussed with @mantissahz, we decide to remove kernel version check because

  • The current implementation is not easy to maintain. We need to figure out a better way to do it.
  • We are unable to include all problematic kernel versions in the check.

derekbit avatar Jul 10 '24 02:07 derekbit

In general, LGTM.

derekbit avatar Jul 10 '24 05:07 derekbit