awsome-distributed-training icon indicating copy to clipboard operation
awsome-distributed-training copied to clipboard

SageMaker Hyperpod "Target not connected"

Open sean-smith opened this issue 1 year ago • 4 comments

If you're trying to connect to your SageMaker Hyperpod cluster and you see an error "An error occurred (TargetNotConnected)", there's a couple of common causes:

An error occurred (TargetNotConnected) when calling the StartSession operation: sagemaker-cluster:..._controller-machine-i-... is not connected.
kex_exchange_identification: Connection closed by remote host
Connection closed by UNKNOWN port 65535

To troubleshoot do a few things:

  1. Check your aws credentials are configured for the right account:
aws sts get-caller-identity --query Account --output text
  1. Check to see the region is correct:
aws configure get region

If those don't work, try and ssm into a compute node, you'll need the cluster-id, worker-group name and instance-id which you can get from the aws sagemaker list-cluster-nodes --cluster-name <cluster-name> CLI call.

aws ssm start-session \
    --target sagemaker-cluster:<cluster-id>_worker-group-1-<instance-id>

Once you're there you can get the ip address of the controller node by running:

sudo cat /opt/ml/config/resource_config.json | jq | grep -5 controller-machine

That'll show:

      "Name": "controller-machine",
      "InstanceType": "ml.m5.12xlarge",
      "Instances": [
        {
          "InstanceName": "controller-machine-1",
          "AgentIpAddress": "172.16.90.220",
          "CustomerIpAddress": "10.1.39.83",
          "InstanceId": "i-0defeb24a1f5dfe85"
        }
      ]

Use the CustomerIpAddress 10.1.39.83 to SSH into headnode from that compute node:

ssh 10.1.39.83

sean-smith avatar Apr 22 '24 16:04 sean-smith

Hi Sean,

Thank you for the detailed message on this, I am in similar situation but when I access my compute node, I get Permission denied (publickey). Can we replace the controller machine?

m-ali4721 avatar Jul 11 '24 16:07 m-ali4721

@m-ali4721 you can't replace the headnode but you can add a login node that'll act as a jump box to the headnode. See instructions on how to do that here: https://catalog.workshops.aws/sagemaker-hyperpod/en-US/05-advanced/07-login-node

Also for the command to access the compute node:

aws ssm start-session \
    --target sagemaker-cluster:<cluster-id>_worker-group-1-<instance-id>

You shouldn't need a SSH keypair, this uses SSM in lieu of SSH. Are you getting the issue here or on the compute node trying to connect to the headnode?

sean-smith avatar Jul 11 '24 17:07 sean-smith

Yes, I successfully access the compute node but while doing "ssh privateIP" of the controller machine from the compute node, I am receiving error: Permission denied (publickey)

m-ali4721 avatar Jul 11 '24 19:07 m-ali4721

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Oct 10 '24 01:10 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Dec 09 '24 02:12 github-actions[bot]