dask-cloudprovider Problem initializing GPU cluster on AWS ECS

Hello,

I am trying to start up the Dask cluster per this tutorial: https://medium.com/rapids-ai/getting-started-with-rapids-on-aws-ecs-using-dask-cloud-provider-b1adfdbc9c6e

I have ECS cluster running

I am getting error when initializing cluster:

cluster = ECSCluster( cluster_arn="arn:aws:ecs:us-east-1:*********:cluster/dask-cluster", n_workers=1, worker_gpu=1, fargate_scheduler=True )

Any help is appreciated!

May 08 '20 16:05 nshaposh

Please note that the current maintainer of this project is on leave, and it might take some time for you to get help.

I cannot tell anything from your exception - but it would be better to paste in the full text here (rather than screenshots), and also to investigate the container/host logs as provided by AWS. I only see "reason: ATTRIBUTE", which doesn't mean much to me.

May 08 '20 16:05 martindurant

Here is the full output of the error:

/opt/anaconda3/lib/python3.7/contextlib.py:119: UserWarning: Creating your cluster is taking a surprisingly long time. This is likely due to pending resources on AWS. Hang tight! 
  next(self.gen)
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/anaconda3/lib/python3.7/asyncio/tasks.py:623> exception=RuntimeError({'tasks': [], 'failures': [{'arn': 'arn:aws:ecs:us-east-1:445627148856:container-instance/78cf8f66-4cc9-4d22-ad8f-6cf66e54e847', 'reason': 'ATTRIBUTE'}], 'ResponseMetadata': {'RequestId': '2e8899db-2275-4361-b1a6-bcaa3bb94060', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '2e8899db-2275-4361-b1a6-bcaa3bb94060', 'content-type': 'application/x-amz-json-1.1', 'content-length': '147', 'date': 'Fri, 08 May 2020 16:23:18 GMT'}, 'RetryAttempts': 0}})>
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/opt/anaconda3/lib/python3.7/site-packages/dask_cloudprovider/providers/aws/ecs.py", line 130, in _
    await self.start()
  File "/opt/anaconda3/lib/python3.7/site-packages/dask_cloudprovider/providers/aws/ecs.py", line 193, in start
    while timeout.run():
  File "/opt/anaconda3/lib/python3.7/site-packages/dask_cloudprovider/utils/timeout.py", line 74, in run
    raise self.exception
  File "/opt/anaconda3/lib/python3.7/site-packages/dask_cloudprovider/providers/aws/ecs.py", line 229, in start
    raise RuntimeError(response)  # print entire response
RuntimeError: {'tasks': [], 'failures': [{'arn': 'arn:aws:ecs:us-east-1:445627148856:container-instance/78cf8f66-4cc9-4d22-ad8f-6cf66e54e847', 'reason': 'ATTRIBUTE'}], 'ResponseMetadata': {'RequestId': '2e8899db-2275-4361-b1a6-bcaa3bb94060', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '2e8899db-2275-4361-b1a6-bcaa3bb94060', 'content-type': 'application/x-amz-json-1.1', 'content-length': '147', 'date': 'Fri, 08 May 2020 16:23:18 GMT'}, 'RetryAttempts': 0}}
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-108beeef8cf8> in <module>
      3 n_workers=1,
      4 worker_gpu=1,
----> 5 fargate_scheduler=True
      6 )

/opt/anaconda3/lib/python3.7/site-packages/dask_cloudprovider/providers/aws/ecs.py in __init__(self, fargate_scheduler, fargate_workers, image, scheduler_cpu, scheduler_mem, scheduler_timeout, worker_cpu, worker_mem, worker_gpu, n_workers, cluster_arn, cluster_name_template, execution_role_arn, task_role_arn, task_role_policies, cloudwatch_logs_group, cloudwatch_logs_stream_prefix, cloudwatch_logs_default_retention, vpc, subnets, security_groups, environment, tags, find_address_timeout, skip_cleanup, aws_access_key_id, aws_secret_access_key, region_name, **kwargs)
    593         self._region_name = region_name
    594         self._lock = asyncio.Lock()
--> 595         super().__init__(**kwargs)
    596 
    597     async def _start(self,):

/opt/anaconda3/lib/python3.7/site-packages/distributed/deploy/spec.py in __init__(self, workers, scheduler, worker, asynchronous, loop, security, silence_logs, name)
    255             self._loop_runner.start()
    256             self.sync(self._start)
--> 257             self.sync(self._correct_state)
    258 
    259     async def _start(self):

/opt/anaconda3/lib/python3.7/site-packages/distributed/deploy/cluster.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    161             return future
    162         else:
--> 163             return sync(self.loop, func, *args, **kwargs)
    164 
    165     async def _get_logs(self, scheduler=True, workers=True):

/opt/anaconda3/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    345     if error[0]:
    346         typ, exc, tb = error[0]
--> 347         raise exc.with_traceback(tb)
    348     else:
    349         return result[0]

/opt/anaconda3/lib/python3.7/site-packages/distributed/utils.py in f()
    329             if callback_timeout is not None:
    330                 future = asyncio.wait_for(future, callback_timeout)
--> 331             result[0] = yield future
    332         except Exception as exc:
    333             error[0] = sys.exc_info()

/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py in run(self)
    733 
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()

/opt/anaconda3/lib/python3.7/site-packages/distributed/deploy/spec.py in _correct_state_internal(self)
    333                 for w in workers:
    334                     w._cluster = weakref.ref(self)
--> 335                     await w  # for tornado gen.coroutine support
    336             self.workers.update(dict(zip(to_open, workers)))
    337 

/opt/anaconda3/lib/python3.7/site-packages/dask_cloudprovider/providers/aws/ecs.py in _()
    128             async with self.lock:
    129                 if not self.task:
--> 130                     await self.start()
    131                     assert self.task
    132             return self

/opt/anaconda3/lib/python3.7/site-packages/dask_cloudprovider/providers/aws/ecs.py in start(self)
    191     async def start(self):
    192         timeout = Timeout(60, "Unable to start %s after 60 seconds" % self.task_type)
--> 193         while timeout.run():
    194             try:
    195                 kwargs = (

/opt/anaconda3/lib/python3.7/site-packages/dask_cloudprovider/utils/timeout.py in run(self)
     72                 return False
     73             else:
---> 74                 raise self.exception
     75         return True
     76 

/opt/anaconda3/lib/python3.7/site-packages/dask_cloudprovider/providers/aws/ecs.py in start(self)
    227 
    228                 if not response.get("tasks"):
--> 229                     raise RuntimeError(response)  # print entire response
    230 
    231                 [self.task] = response["tasks"]

RuntimeError: {'tasks': [], 'failures': [{'arn': 'arn:aws:ecs:us-east-1:445627148856:container-instance/78cf8f66-4cc9-4d22-ad8f-6cf66e54e847', 'reason': 'ATTRIBUTE'}], 'ResponseMetadata': {'RequestId': '5f665f57-00f2-4463-b191-470fe5e951f9', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '5f665f57-00f2-4463-b191-470fe5e951f9', 'content-type': 'application/x-amz-json-1.1', 'content-length': '147', 'date': 'Fri, 08 May 2020 16:24:19 GMT'}, 'RetryAttempts': 0}}

May 08 '20 17:05 nshaposh

@nshaposh See the AWS API error message descriptions on this page: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/api_failures_messages.html

ATTRIBUTE (container instance ID) Your task definition contains a parameter that requires a specific container instance attribute that is not available on your container instances. For example, if your task uses the awsvpc network mode, but there are no instances in your specified subnets with the ecs.capability.task-eni attribute. For more information about which attributes are required for specific task definition parameters and agent configuration variables, see Task Definition Parameters and Amazon ECS Container Agent Configuration.

May 08 '20 17:05 joeschmid

Sorry for the delay here I've been out for a while.

My guess from the logs you've shared is that the container instance does not have GPU support. Which instance class did you choose?

Jun 04 '20 13:06 jacobtomlinson

Small nudge here @nshaposh. Have you been able to troubleshoot this further?

Jun 29 '20 13:06 jacobtomlinson

Hello,

Unfortunately, I heaven't been able to fix this yet as I was busy with other project. This is also expensive experiment is terms of cloud resources. I am planning to get back to this later this week, if it is ok.

Thank you!

Nikolai

пн, 29 июн. 2020 г. в 09:25, Jacob Tomlinson [email protected]:

Small nudge here @nshaposh https://github.com/nshaposh. Have you been able to troubleshoot this further?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-cloudprovider/issues/95#issuecomment-651118902, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSISVRLOUGEZ7TDM24BGW3RZCI43ANCNFSM4M4JUJJA .

Jun 29 '20 16:06 nshaposh

Hey @jacobtomlinson, not sure whatever happened here, but I'm trying to set up an example cluster in aws and happened across the same tutorial mentioned at the top of the post.

I created an ECS Cluster from the AWS console, using all of the defaults (creating new roles, etc). I have a single g4dn.4xlarge instance spun up for my ECS cluster. I also have admin aws perms on my local machine.

When running the following code

from dask_cloudprovider.aws import ECSCluster
cluster = ECSCluster(cluster_arn="arn:aws:ecs:us-west-1:<account>:cluster/test-dask-ecs", n_workers=1, security_groups=['sg-05503c0960ad41359'], fargate_scheduler=True, worker_gpu=1, worker_mem=1000, worker_cpu=1000)

I hit a similar error to above:

/usr/local/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/contextlib.py:126: UserWarning: Creating your cluster is taking a surprisingly long time. This is likely due to pending resources on AWS. Hang tight!
  next(self.gen)
Task exception was never retrieved
future: <Task finished name='Task-8434' coro=<_wrap_awaitable() done, defined at /usr/local/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/asyncio/tasks.py:681> exception=RuntimeError({'tasks': [], 'failures': [{'arn': 'arn:aws:ecs:us-west-1:<account>:container-instance/4efb17d8911a432a80f646c3ea5bc94e', 'reason': 'ATTRIBUTE'}], 'ResponseMetadata': {'RequestId': '7acd9c24-f46a-40ef-b833-6eda4afd0c07', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '7acd9c24-f46a-40ef-b833-6eda4afd0c07', 'content-type': 'application/x-amz-json-1.1', 'content-length': '143', 'date': 'Fri, 07 Oct 2022 04:37:29 GMT'}, 'RetryAttempts': 0}})>
Traceback (most recent call last):
  File "/usr/local/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/asyncio/tasks.py", line 688, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/usr/local/lib/python3.9/site-packages/dask_cloudprovider/aws/ecs.py", line 171, in _
    await self.start()
  File "/usr/local/lib/python3.9/site-packages/dask_cloudprovider/aws/ecs.py", line 240, in start
    while timeout.run():
  File "/usr/local/lib/python3.9/site-packages/dask_cloudprovider/utils/timeout.py", line 74, in run
    raise self.exception
  File "/usr/local/lib/python3.9/site-packages/dask_cloudprovider/aws/ecs.py", line 294, in start
    raise RuntimeError(response)  # print entire response
RuntimeError: {'tasks': [], 'failures': [{'arn': 'arn:aws:ecs:us-west-1:<account>:container-instance/4efb17d8911a432a80f646c3ea5bc94e', 'reason': 'ATTRIBUTE'}], 'ResponseMetadata': {'RequestId': '7acd9c24-f46a-40ef-b833-6eda4afd0c07', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '7acd9c24-f46a-40ef-b833-6eda4afd0c07', 'content-type': 'application/x-amz-json-1.1', 'content-length': '143', 'date': 'Fri, 07 Oct 2022 04:37:29 GMT'}, 'RetryAttempts': 0}}
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File /usr/local/lib/python3.9/site-packages/distributed/deploy/spec.py:277, in SpecCluster.__init__(self, workers, scheduler, worker, asynchronous, loop, security, silence_logs, name, shutdown_on_close, scheduler_sync_interval)
    276 try:
--> 277     self.sync(self._correct_state)
    278 except Exception:

File /usr/local/lib/python3.9/site-packages/distributed/utils.py:339, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    338 else:
--> 339     return sync(
    340         self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    341     )

File /usr/local/lib/python3.9/site-packages/distributed/utils.py:406, in sync(loop, func, callback_timeout, *args, **kwargs)
    405     typ, exc, tb = error
--> 406     raise exc.with_traceback(tb)
    407 else:

File /usr/local/lib/python3.9/site-packages/distributed/utils.py:379, in sync.<locals>.f()
    378     future = asyncio.ensure_future(future)
--> 379     result = yield future
    380 except Exception:

File /usr/local/lib/python3.9/site-packages/tornado/gen.py:762, in Runner.run(self)
    761 try:
--> 762     value = future.result()
    763 except Exception:

File /usr/local/lib/python3.9/site-packages/distributed/deploy/spec.py:369, in SpecCluster._correct_state_internal(self)
    368         w._cluster = weakref.ref(self)
--> 369         await w  # for tornado gen.coroutine support
    370 self.workers.update(dict(zip(to_open, workers)))

File /usr/local/lib/python3.9/site-packages/dask_cloudprovider/aws/ecs.py:171, in Task.__await__.<locals>._()
    170 if not self.task:
--> 171     await self.start()
    172     assert self.task

File /usr/local/lib/python3.9/site-packages/dask_cloudprovider/aws/ecs.py:240, in Task.start(self)
    239 timeout = Timeout(60, "Unable to start %s after 60 seconds" % self.task_type)
--> 240 while timeout.run():
    241     try:

File /usr/local/lib/python3.9/site-packages/dask_cloudprovider/utils/timeout.py:74, in Timeout.run(self)
     73     else:
---> 74         raise self.exception
     75 return True

File /usr/local/lib/python3.9/site-packages/dask_cloudprovider/aws/ecs.py:294, in Task.start(self)
    293 if not response.get("tasks"):
--> 294     raise RuntimeError(response)  # print entire response
    296 [self.task] = response["tasks"]

RuntimeError: {'tasks': [], 'failures': [{'arn': 'arn:aws:ecs:us-west-1:<account>:container-instance/4efb17d8911a432a80f646c3ea5bc94e', 'reason': 'ATTRIBUTE'}], 'ResponseMetadata': {'RequestId': '4583d053-e90f-4423-a7a2-37584e35c75e', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '4583d053-e90f-4423-a7a2-37584e35c75e', 'content-type': 'application/x-amz-json-1.1', 'content-length': '143', 'date': 'Fri, 07 Oct 2022 04:38:29 GMT'}, 'RetryAttempts': 0}}

During handling of the above exception, another exception occurred:

AssertionError                            Traceback (most recent call last)
Input In [6], in <cell line: 1>()
----> 1 cluster = ECSCluster(cluster_arn="arn:aws:ecs:us-west-1:<account>:cluster/test-dask-ecs", n_workers=1, security_groups=['sg-05503c0960ad41359'],  worker_gpu=1, worker_mem=1000, worker_cpu=1000, fargate_scheduler=True)

File /usr/local/lib/python3.9/site-packages/dask_cloudprovider/aws/ecs.py:788, in ECSCluster.__init__(self, fargate_scheduler, fargate_workers, fargate_spot, image, scheduler_cpu, scheduler_mem, scheduler_timeout, scheduler_extra_args, scheduler_task_kwargs, scheduler_address, worker_cpu, worker_nthreads, worker_mem, worker_gpu, worker_extra_args, worker_task_kwargs, n_workers, workers_name_start, workers_name_step, cluster_arn, cluster_name_template, execution_role_arn, task_role_arn, task_role_policies, cloudwatch_logs_group, cloudwatch_logs_stream_prefix, cloudwatch_logs_default_retention, vpc, subnets, security_groups, environment, tags, find_address_timeout, skip_cleanup, aws_access_key_id, aws_secret_access_key, region_name, platform_version, fargate_use_private_ip, mount_points, volumes, mount_volumes_on_scheduler, **kwargs)
    786 self._lock = asyncio.Lock()
    787 self.session = get_session()
--> 788 super().__init__(**kwargs)

File /usr/local/lib/python3.9/site-packages/distributed/deploy/spec.py:279, in SpecCluster.__init__(self, workers, scheduler, worker, asynchronous, loop, security, silence_logs, name, shutdown_on_close, scheduler_sync_interval)
    277     self.sync(self._correct_state)
    278 except Exception:
--> 279     self.sync(self.close)
    280     raise

File /usr/local/lib/python3.9/site-packages/distributed/utils.py:339, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    337     return future
    338 else:
--> 339     return sync(
    340         self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    341     )

File /usr/local/lib/python3.9/site-packages/distributed/utils.py:406, in sync(loop, func, callback_timeout, *args, **kwargs)
    404 if error:
    405     typ, exc, tb = error
--> 406     raise exc.with_traceback(tb)
    407 else:
    408     return result

File /usr/local/lib/python3.9/site-packages/distributed/utils.py:379, in sync.<locals>.f()
    377         future = asyncio.wait_for(future, callback_timeout)
    378     future = asyncio.ensure_future(future)
--> 379     result = yield future
    380 except Exception:
    381     error = sys.exc_info()

File /usr/local/lib/python3.9/site-packages/tornado/gen.py:762, in Runner.run(self)
    759 exc_info = None
    761 try:
--> 762     value = future.result()
    763 except Exception:
    764     exc_info = sys.exc_info()

File /usr/local/lib/python3.9/site-packages/distributed/deploy/spec.py:437, in SpecCluster._close(self)
    435         await self.scheduler.close()
    436     for w in self._created:
--> 437         assert w.status in {
    438             Status.closing,
    439             Status.closed,
    440             Status.failed,
    441         }, w.status
    443 if hasattr(self, "_old_logging_level"):
    444     silence_logging(self._old_logging_level)

AssertionError: Status.created

I have also compared the required attributes in the task definitions against that of my container instance, and none appear to be missing. The console shows the fargate scheduler spin up, stay in pending for a long time, get run briefly and then spin down. My single worker node on the ec2 instance never spins up. I haven't managed to spin up a single dask scheduler (tried setting fargate_scheduler=False) or worker on the ec2 instance yet.

I do see logs from the fargate worker of:

2022-10-06 21:38:32distributed.scheduler - INFO - End scheduler at 'tcp://172.31.18.1:8786'
2022-10-06 21:38:31distributed.scheduler - INFO - Scheduler closing all comms
2022-10-06 21:38:31distributed.scheduler - INFO - Scheduler closing...
2022-10-06 21:36:27distributed.scheduler - INFO - dashboard at: :8787
2022-10-06 21:36:27distributed.scheduler - INFO - Scheduler at: tcp://172.31.18.1:8786
2022-10-06 21:36:27distributed.scheduler - INFO - Clear task state
2022-10-06 21:36:27distributed.scheduler - INFO - -----------------------------------------------
2022-10-06 21:36:24distributed.scheduler - INFO - -----------------------------------------------
2022-10-06 21:36:23A JupyterLab server has been started!
2022-10-06 21:36:23To access it, visit http://localhost:8888 on your host machine.
2022-10-06 21:36:23Ensure the following arguments were added to "docker run" to expose the JupyterLab server to your host machine:
2022-10-06 21:36:23-p 8888:8888 -p 8787:8787 -p 8786:8786
2022-10-06 21:36:23Make local folders visible by bind mounting to /rapids/notebooks/host
2022-10-06 21:36:23This container image and its contents are governed by the NVIDIA Deep Learning Container License.
2022-10-06 21:36:23By pulling and using the container, you accept the terms and conditions of this license:
2022-10-06 21:36:23https://developer.download.nvidia.com/licenses/NVIDIA_Deep_Learning_Container_License.pdf

I'm at a loss--any idea what's going on? I'll continue searching, but both the aws error message and the dask error message make this a challenge to debug

Oct 07 '22 04:10 cdc97

Yeah looks like this issue lost steam with the OP not having time to debug further. Thanks for following up with all the info I asked for.

The startup code for the worker task creates the task and then wait 60 seconds for the task to run before giving up. Looking at the error response we can see that the task is never created and instead has an error of ATTRIBUTE which is not especially helpful so I'm not sure where to start with this one.

{
   "tasks":[],
   "failures":[
      {
         "arn":"arn:aws:ecs:us-west-1:<account>:container-instance/4efb17d8911a432a80f646c3ea5bc94e",
         "reason":"ATTRIBUTE"
      }
   ],
   "ResponseMetadata":{
      "RequestId":"4583d053-e90f-4423-a7a2-37584e35c75e",
      "HTTPStatusCode":200,
      "HTTPHeaders":{
         "x-amzn-requestid":"4583d053-e90f-4423-a7a2-37584e35c75e",
         "content-type":"application/x-amz-json-1.1",
         "content-length":"143",
         "date":"Fri, 07 Oct 2022 04:38:29 GMT"
      },
      "RetryAttempts":0
   }
}

Is there anything else in the AWS dashboard related to arn arn:aws:ecs:us-west-1:<account>:container-instance/4efb17d8911a432a80f646c3ea5bc94e that provides more information on what went wrong?

Oct 07 '22 14:10 jacobtomlinson

Sorry, never got an email that you had responded! This error ended up being due the ec2 instance being in a different security group and subnet than the cluster (which was surprising, considering I used aws defaults to set everything up). So no ec2 instances existed with my target security groups, and therefore no ec2 instances matched my requested attributes. Seems like AWS could provide better error messaging!

Regardless, I just successfully spun up a GPU worker on an ec2 instance, using a fargate scheduler! Still working out some of the kinks, but certainly seems like I'm on the right track now

Oct 11 '22 21:10 cdc97

Ok great! I'll close this out then but if you need anything else don't hesitate to comment.

Do you think there are docs improvements we could make to avoid others running into this?

Oct 14 '22 16:10 jacobtomlinson

I actually think the improvement should be on AWS's side--like if a user hits a 'no instance has the required attributes' error, they should probably list out the instances that were considered. It would have made this much more apparent, if they showed the subnet being considered, along with 0 instances available in said subnet, rather than just a vague 'failures': [{'arn': 'arn:aws:ecs:us-west-1:<account>:container-instance/4efb17d8911a432a80f646c3ea5bc94e', 'reason': 'ATTRIBUTE'}]

Oct 14 '22 20:10 cdc97

Maybe raise this via AWS support? They are pretty good with listening to user feedback and passing it along to their teams.

Oct 18 '22 10:10 jacobtomlinson

dask-cloudprovider dask-cloudprovider copied to clipboard

Problem initializing GPU cluster on AWS ECS

dask-cloudprovider
dask-cloudprovider copied to clipboard