machine-controller-manager icon indicating copy to clipboard operation
machine-controller-manager copied to clipboard

Check instance reachable status in machine-controller-manager while checking new machine joining machine deployment

Open neo-liang-sap opened this issue 3 years ago • 3 comments

How to categorize this issue? /area control-plane /kind enhancement /priority 3

What would you like to be added:

in AWS, sometimes instance is running but not reachable, in aws there's a command to check this reachable status aws ec2 describe-instance-status --instance-ids i-01e71990bfe658adc

aws ec2 describe-instance-status --instance-ids i-01e71990bfe658adc
{
    "InstanceStatuses": [
        {
            "AvailabilityZone": "eu-central-1a",
            "InstanceId": "i-01e71990bfe658adc",
            "InstanceState": {
                "Code": 16,
                "Name": "running"
            },
            "InstanceStatus": {
                "Details": [
                    {
                        "ImpairedSince": "2022-06-21T06:28:00+00:00",
                        "Name": "reachability",
                        "Status": "failed"
                    }
                ],
                "Status": "impaired"
            },
            "SystemStatus": {
                "Details": [
                    {
                        "Name": "reachability",
                        "Status": "passed"
                    }
                ],
                "Status": "ok"
            }
        }
    ]
}

this instance is running but not reachable

Is it possible to add some check in MCM whether the instance is reachable?

Why is this needed:

To have better understanding what's the process of machine joining the cluster, e.g. sometime machine created, after 20mins, deleted by MCM and recreated another one....

CC @dguendisch

neo-liang-sap avatar Jun 21 '22 07:06 neo-liang-sap

@neo-liang-sap Label area/todo does not exist.

gardener-robot avatar Jun 21 '22 07:06 gardener-robot

Yes we will work on adding such feature. Some research is required first to see if other providers also provide such networking info of an instance directly or not.

himanshu-kun avatar Jun 27 '22 07:06 himanshu-kun

Post Grooming discussion

We need to enhance driver method GetMachineStatus to also do some checks like reachability mentioned above, and enahance GetMachineStatusResponse to contain the result of the check. Then we should update the error in machine status to reflect that, so that it goes till the status of higher level controllers and get reflected in dashboard for user to see.

himanshu-kun avatar Feb 23 '23 11:02 himanshu-kun