dask-cloudprovider
dask-cloudprovider copied to clipboard
Problem initializing GPU cluster on AWS ECS
Hello,
I am trying to start up the Dask cluster per this tutorial: https://medium.com/rapids-ai/getting-started-with-rapids-on-aws-ecs-using-dask-cloud-provider-b1adfdbc9c6e
I have ECS cluster running
data:image/s3,"s3://crabby-images/26ade/26ade75ee916142d2b18e20093cf30b3938f798b" alt="Screen Shot 2020-05-08 at 12 38 21 PM"
I am getting error when initializing cluster:
cluster = ECSCluster( cluster_arn="arn:aws:ecs:us-east-1:*********:cluster/dask-cluster", n_workers=1, worker_gpu=1, fargate_scheduler=True )
data:image/s3,"s3://crabby-images/d3dd2/d3dd279afe64f5e7cb371e4eacf19d812ac62080" alt="Screen Shot 2020-05-08 at 12 41 57 PM"
Any help is appreciated!
Please note that the current maintainer of this project is on leave, and it might take some time for you to get help.
I cannot tell anything from your exception - but it would be better to paste in the full text here (rather than screenshots), and also to investigate the container/host logs as provided by AWS. I only see "reason: ATTRIBUTE", which doesn't mean much to me.
Here is the full output of the error:
/opt/anaconda3/lib/python3.7/contextlib.py:119: UserWarning: Creating your cluster is taking a surprisingly long time. This is likely due to pending resources on AWS. Hang tight!
next(self.gen)
Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /opt/anaconda3/lib/python3.7/asyncio/tasks.py:623> exception=RuntimeError({'tasks': [], 'failures': [{'arn': 'arn:aws:ecs:us-east-1:445627148856:container-instance/78cf8f66-4cc9-4d22-ad8f-6cf66e54e847', 'reason': 'ATTRIBUTE'}], 'ResponseMetadata': {'RequestId': '2e8899db-2275-4361-b1a6-bcaa3bb94060', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '2e8899db-2275-4361-b1a6-bcaa3bb94060', 'content-type': 'application/x-amz-json-1.1', 'content-length': '147', 'date': 'Fri, 08 May 2020 16:23:18 GMT'}, 'RetryAttempts': 0}})>
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/opt/anaconda3/lib/python3.7/site-packages/dask_cloudprovider/providers/aws/ecs.py", line 130, in _
await self.start()
File "/opt/anaconda3/lib/python3.7/site-packages/dask_cloudprovider/providers/aws/ecs.py", line 193, in start
while timeout.run():
File "/opt/anaconda3/lib/python3.7/site-packages/dask_cloudprovider/utils/timeout.py", line 74, in run
raise self.exception
File "/opt/anaconda3/lib/python3.7/site-packages/dask_cloudprovider/providers/aws/ecs.py", line 229, in start
raise RuntimeError(response) # print entire response
RuntimeError: {'tasks': [], 'failures': [{'arn': 'arn:aws:ecs:us-east-1:445627148856:container-instance/78cf8f66-4cc9-4d22-ad8f-6cf66e54e847', 'reason': 'ATTRIBUTE'}], 'ResponseMetadata': {'RequestId': '2e8899db-2275-4361-b1a6-bcaa3bb94060', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '2e8899db-2275-4361-b1a6-bcaa3bb94060', 'content-type': 'application/x-amz-json-1.1', 'content-length': '147', 'date': 'Fri, 08 May 2020 16:23:18 GMT'}, 'RetryAttempts': 0}}
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-3-108beeef8cf8> in <module>
3 n_workers=1,
4 worker_gpu=1,
----> 5 fargate_scheduler=True
6 )
/opt/anaconda3/lib/python3.7/site-packages/dask_cloudprovider/providers/aws/ecs.py in __init__(self, fargate_scheduler, fargate_workers, image, scheduler_cpu, scheduler_mem, scheduler_timeout, worker_cpu, worker_mem, worker_gpu, n_workers, cluster_arn, cluster_name_template, execution_role_arn, task_role_arn, task_role_policies, cloudwatch_logs_group, cloudwatch_logs_stream_prefix, cloudwatch_logs_default_retention, vpc, subnets, security_groups, environment, tags, find_address_timeout, skip_cleanup, aws_access_key_id, aws_secret_access_key, region_name, **kwargs)
593 self._region_name = region_name
594 self._lock = asyncio.Lock()
--> 595 super().__init__(**kwargs)
596
597 async def _start(self,):
/opt/anaconda3/lib/python3.7/site-packages/distributed/deploy/spec.py in __init__(self, workers, scheduler, worker, asynchronous, loop, security, silence_logs, name)
255 self._loop_runner.start()
256 self.sync(self._start)
--> 257 self.sync(self._correct_state)
258
259 async def _start(self):
/opt/anaconda3/lib/python3.7/site-packages/distributed/deploy/cluster.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
161 return future
162 else:
--> 163 return sync(self.loop, func, *args, **kwargs)
164
165 async def _get_logs(self, scheduler=True, workers=True):
/opt/anaconda3/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
345 if error[0]:
346 typ, exc, tb = error[0]
--> 347 raise exc.with_traceback(tb)
348 else:
349 return result[0]
/opt/anaconda3/lib/python3.7/site-packages/distributed/utils.py in f()
329 if callback_timeout is not None:
330 future = asyncio.wait_for(future, callback_timeout)
--> 331 result[0] = yield future
332 except Exception as exc:
333 error[0] = sys.exc_info()
/opt/anaconda3/lib/python3.7/site-packages/tornado/gen.py in run(self)
733
734 try:
--> 735 value = future.result()
736 except Exception:
737 exc_info = sys.exc_info()
/opt/anaconda3/lib/python3.7/site-packages/distributed/deploy/spec.py in _correct_state_internal(self)
333 for w in workers:
334 w._cluster = weakref.ref(self)
--> 335 await w # for tornado gen.coroutine support
336 self.workers.update(dict(zip(to_open, workers)))
337
/opt/anaconda3/lib/python3.7/site-packages/dask_cloudprovider/providers/aws/ecs.py in _()
128 async with self.lock:
129 if not self.task:
--> 130 await self.start()
131 assert self.task
132 return self
/opt/anaconda3/lib/python3.7/site-packages/dask_cloudprovider/providers/aws/ecs.py in start(self)
191 async def start(self):
192 timeout = Timeout(60, "Unable to start %s after 60 seconds" % self.task_type)
--> 193 while timeout.run():
194 try:
195 kwargs = (
/opt/anaconda3/lib/python3.7/site-packages/dask_cloudprovider/utils/timeout.py in run(self)
72 return False
73 else:
---> 74 raise self.exception
75 return True
76
/opt/anaconda3/lib/python3.7/site-packages/dask_cloudprovider/providers/aws/ecs.py in start(self)
227
228 if not response.get("tasks"):
--> 229 raise RuntimeError(response) # print entire response
230
231 [self.task] = response["tasks"]
RuntimeError: {'tasks': [], 'failures': [{'arn': 'arn:aws:ecs:us-east-1:445627148856:container-instance/78cf8f66-4cc9-4d22-ad8f-6cf66e54e847', 'reason': 'ATTRIBUTE'}], 'ResponseMetadata': {'RequestId': '5f665f57-00f2-4463-b191-470fe5e951f9', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '5f665f57-00f2-4463-b191-470fe5e951f9', 'content-type': 'application/x-amz-json-1.1', 'content-length': '147', 'date': 'Fri, 08 May 2020 16:24:19 GMT'}, 'RetryAttempts': 0}}
@nshaposh See the AWS API error message descriptions on this page: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/api_failures_messages.html
ATTRIBUTE (container instance ID) Your task definition contains a parameter that requires a specific container instance attribute that is not available on your container instances. For example, if your task uses the awsvpc network mode, but there are no instances in your specified subnets with the ecs.capability.task-eni attribute. For more information about which attributes are required for specific task definition parameters and agent configuration variables, see Task Definition Parameters and Amazon ECS Container Agent Configuration.
Sorry for the delay here I've been out for a while.
My guess from the logs you've shared is that the container instance does not have GPU support. Which instance class did you choose?
Small nudge here @nshaposh. Have you been able to troubleshoot this further?
Hello,
Unfortunately, I heaven't been able to fix this yet as I was busy with other project. This is also expensive experiment is terms of cloud resources. I am planning to get back to this later this week, if it is ok.
Thank you!
Nikolai
пн, 29 июн. 2020 г. в 09:25, Jacob Tomlinson [email protected]:
Small nudge here @nshaposh https://github.com/nshaposh. Have you been able to troubleshoot this further?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-cloudprovider/issues/95#issuecomment-651118902, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABSISVRLOUGEZ7TDM24BGW3RZCI43ANCNFSM4M4JUJJA .
Hey @jacobtomlinson, not sure whatever happened here, but I'm trying to set up an example cluster in aws and happened across the same tutorial mentioned at the top of the post.
I created an ECS Cluster from the AWS console, using all of the defaults (creating new roles, etc). I have a single g4dn.4xlarge instance spun up for my ECS cluster. I also have admin aws perms on my local machine.
When running the following code
from dask_cloudprovider.aws import ECSCluster
cluster = ECSCluster(cluster_arn="arn:aws:ecs:us-west-1:<account>:cluster/test-dask-ecs", n_workers=1, security_groups=['sg-05503c0960ad41359'], fargate_scheduler=True, worker_gpu=1, worker_mem=1000, worker_cpu=1000)
I hit a similar error to above:
/usr/local/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/contextlib.py:126: UserWarning: Creating your cluster is taking a surprisingly long time. This is likely due to pending resources on AWS. Hang tight!
next(self.gen)
Task exception was never retrieved
future: <Task finished name='Task-8434' coro=<_wrap_awaitable() done, defined at /usr/local/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/asyncio/tasks.py:681> exception=RuntimeError({'tasks': [], 'failures': [{'arn': 'arn:aws:ecs:us-west-1:<account>:container-instance/4efb17d8911a432a80f646c3ea5bc94e', 'reason': 'ATTRIBUTE'}], 'ResponseMetadata': {'RequestId': '7acd9c24-f46a-40ef-b833-6eda4afd0c07', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '7acd9c24-f46a-40ef-b833-6eda4afd0c07', 'content-type': 'application/x-amz-json-1.1', 'content-length': '143', 'date': 'Fri, 07 Oct 2022 04:37:29 GMT'}, 'RetryAttempts': 0}})>
Traceback (most recent call last):
File "/usr/local/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/asyncio/tasks.py", line 688, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/usr/local/lib/python3.9/site-packages/dask_cloudprovider/aws/ecs.py", line 171, in _
await self.start()
File "/usr/local/lib/python3.9/site-packages/dask_cloudprovider/aws/ecs.py", line 240, in start
while timeout.run():
File "/usr/local/lib/python3.9/site-packages/dask_cloudprovider/utils/timeout.py", line 74, in run
raise self.exception
File "/usr/local/lib/python3.9/site-packages/dask_cloudprovider/aws/ecs.py", line 294, in start
raise RuntimeError(response) # print entire response
RuntimeError: {'tasks': [], 'failures': [{'arn': 'arn:aws:ecs:us-west-1:<account>:container-instance/4efb17d8911a432a80f646c3ea5bc94e', 'reason': 'ATTRIBUTE'}], 'ResponseMetadata': {'RequestId': '7acd9c24-f46a-40ef-b833-6eda4afd0c07', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '7acd9c24-f46a-40ef-b833-6eda4afd0c07', 'content-type': 'application/x-amz-json-1.1', 'content-length': '143', 'date': 'Fri, 07 Oct 2022 04:37:29 GMT'}, 'RetryAttempts': 0}}
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
File /usr/local/lib/python3.9/site-packages/distributed/deploy/spec.py:277, in SpecCluster.__init__(self, workers, scheduler, worker, asynchronous, loop, security, silence_logs, name, shutdown_on_close, scheduler_sync_interval)
276 try:
--> 277 self.sync(self._correct_state)
278 except Exception:
File /usr/local/lib/python3.9/site-packages/distributed/utils.py:339, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
338 else:
--> 339 return sync(
340 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
341 )
File /usr/local/lib/python3.9/site-packages/distributed/utils.py:406, in sync(loop, func, callback_timeout, *args, **kwargs)
405 typ, exc, tb = error
--> 406 raise exc.with_traceback(tb)
407 else:
File /usr/local/lib/python3.9/site-packages/distributed/utils.py:379, in sync.<locals>.f()
378 future = asyncio.ensure_future(future)
--> 379 result = yield future
380 except Exception:
File /usr/local/lib/python3.9/site-packages/tornado/gen.py:762, in Runner.run(self)
761 try:
--> 762 value = future.result()
763 except Exception:
File /usr/local/lib/python3.9/site-packages/distributed/deploy/spec.py:369, in SpecCluster._correct_state_internal(self)
368 w._cluster = weakref.ref(self)
--> 369 await w # for tornado gen.coroutine support
370 self.workers.update(dict(zip(to_open, workers)))
File /usr/local/lib/python3.9/site-packages/dask_cloudprovider/aws/ecs.py:171, in Task.__await__.<locals>._()
170 if not self.task:
--> 171 await self.start()
172 assert self.task
File /usr/local/lib/python3.9/site-packages/dask_cloudprovider/aws/ecs.py:240, in Task.start(self)
239 timeout = Timeout(60, "Unable to start %s after 60 seconds" % self.task_type)
--> 240 while timeout.run():
241 try:
File /usr/local/lib/python3.9/site-packages/dask_cloudprovider/utils/timeout.py:74, in Timeout.run(self)
73 else:
---> 74 raise self.exception
75 return True
File /usr/local/lib/python3.9/site-packages/dask_cloudprovider/aws/ecs.py:294, in Task.start(self)
293 if not response.get("tasks"):
--> 294 raise RuntimeError(response) # print entire response
296 [self.task] = response["tasks"]
RuntimeError: {'tasks': [], 'failures': [{'arn': 'arn:aws:ecs:us-west-1:<account>:container-instance/4efb17d8911a432a80f646c3ea5bc94e', 'reason': 'ATTRIBUTE'}], 'ResponseMetadata': {'RequestId': '4583d053-e90f-4423-a7a2-37584e35c75e', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '4583d053-e90f-4423-a7a2-37584e35c75e', 'content-type': 'application/x-amz-json-1.1', 'content-length': '143', 'date': 'Fri, 07 Oct 2022 04:38:29 GMT'}, 'RetryAttempts': 0}}
During handling of the above exception, another exception occurred:
AssertionError Traceback (most recent call last)
Input In [6], in <cell line: 1>()
----> 1 cluster = ECSCluster(cluster_arn="arn:aws:ecs:us-west-1:<account>:cluster/test-dask-ecs", n_workers=1, security_groups=['sg-05503c0960ad41359'], worker_gpu=1, worker_mem=1000, worker_cpu=1000, fargate_scheduler=True)
File /usr/local/lib/python3.9/site-packages/dask_cloudprovider/aws/ecs.py:788, in ECSCluster.__init__(self, fargate_scheduler, fargate_workers, fargate_spot, image, scheduler_cpu, scheduler_mem, scheduler_timeout, scheduler_extra_args, scheduler_task_kwargs, scheduler_address, worker_cpu, worker_nthreads, worker_mem, worker_gpu, worker_extra_args, worker_task_kwargs, n_workers, workers_name_start, workers_name_step, cluster_arn, cluster_name_template, execution_role_arn, task_role_arn, task_role_policies, cloudwatch_logs_group, cloudwatch_logs_stream_prefix, cloudwatch_logs_default_retention, vpc, subnets, security_groups, environment, tags, find_address_timeout, skip_cleanup, aws_access_key_id, aws_secret_access_key, region_name, platform_version, fargate_use_private_ip, mount_points, volumes, mount_volumes_on_scheduler, **kwargs)
786 self._lock = asyncio.Lock()
787 self.session = get_session()
--> 788 super().__init__(**kwargs)
File /usr/local/lib/python3.9/site-packages/distributed/deploy/spec.py:279, in SpecCluster.__init__(self, workers, scheduler, worker, asynchronous, loop, security, silence_logs, name, shutdown_on_close, scheduler_sync_interval)
277 self.sync(self._correct_state)
278 except Exception:
--> 279 self.sync(self.close)
280 raise
File /usr/local/lib/python3.9/site-packages/distributed/utils.py:339, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
337 return future
338 else:
--> 339 return sync(
340 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
341 )
File /usr/local/lib/python3.9/site-packages/distributed/utils.py:406, in sync(loop, func, callback_timeout, *args, **kwargs)
404 if error:
405 typ, exc, tb = error
--> 406 raise exc.with_traceback(tb)
407 else:
408 return result
File /usr/local/lib/python3.9/site-packages/distributed/utils.py:379, in sync.<locals>.f()
377 future = asyncio.wait_for(future, callback_timeout)
378 future = asyncio.ensure_future(future)
--> 379 result = yield future
380 except Exception:
381 error = sys.exc_info()
File /usr/local/lib/python3.9/site-packages/tornado/gen.py:762, in Runner.run(self)
759 exc_info = None
761 try:
--> 762 value = future.result()
763 except Exception:
764 exc_info = sys.exc_info()
File /usr/local/lib/python3.9/site-packages/distributed/deploy/spec.py:437, in SpecCluster._close(self)
435 await self.scheduler.close()
436 for w in self._created:
--> 437 assert w.status in {
438 Status.closing,
439 Status.closed,
440 Status.failed,
441 }, w.status
443 if hasattr(self, "_old_logging_level"):
444 silence_logging(self._old_logging_level)
AssertionError: Status.created
I have also compared the required attributes in the task definitions against that of my container instance, and none appear to be missing. The console shows the fargate scheduler spin up, stay in pending for a long time, get run briefly and then spin down. My single worker node on the ec2 instance never spins up. I haven't managed to spin up a single dask scheduler (tried setting fargate_scheduler=False) or worker on the ec2 instance yet.
I do see logs from the fargate worker of:
2022-10-06 21:38:32distributed.scheduler - INFO - End scheduler at 'tcp://172.31.18.1:8786'
2022-10-06 21:38:31distributed.scheduler - INFO - Scheduler closing all comms
2022-10-06 21:38:31distributed.scheduler - INFO - Scheduler closing...
2022-10-06 21:36:27distributed.scheduler - INFO - dashboard at: :8787
2022-10-06 21:36:27distributed.scheduler - INFO - Scheduler at: tcp://172.31.18.1:8786
2022-10-06 21:36:27distributed.scheduler - INFO - Clear task state
2022-10-06 21:36:27distributed.scheduler - INFO - -----------------------------------------------
2022-10-06 21:36:24distributed.scheduler - INFO - -----------------------------------------------
2022-10-06 21:36:23A JupyterLab server has been started!
2022-10-06 21:36:23To access it, visit http://localhost:8888 on your host machine.
2022-10-06 21:36:23Ensure the following arguments were added to "docker run" to expose the JupyterLab server to your host machine:
2022-10-06 21:36:23-p 8888:8888 -p 8787:8787 -p 8786:8786
2022-10-06 21:36:23Make local folders visible by bind mounting to /rapids/notebooks/host
2022-10-06 21:36:23This container image and its contents are governed by the NVIDIA Deep Learning Container License.
2022-10-06 21:36:23By pulling and using the container, you accept the terms and conditions of this license:
2022-10-06 21:36:23https://developer.download.nvidia.com/licenses/NVIDIA_Deep_Learning_Container_License.pdf
I'm at a loss--any idea what's going on? I'll continue searching, but both the aws error message and the dask error message make this a challenge to debug
Yeah looks like this issue lost steam with the OP not having time to debug further. Thanks for following up with all the info I asked for.
The startup code for the worker task creates the task and then wait 60 seconds for the task to run before giving up. Looking at the error response we can see that the task is never created and instead has an error of ATTRIBUTE
which is not especially helpful so I'm not sure where to start with this one.
{
"tasks":[],
"failures":[
{
"arn":"arn:aws:ecs:us-west-1:<account>:container-instance/4efb17d8911a432a80f646c3ea5bc94e",
"reason":"ATTRIBUTE"
}
],
"ResponseMetadata":{
"RequestId":"4583d053-e90f-4423-a7a2-37584e35c75e",
"HTTPStatusCode":200,
"HTTPHeaders":{
"x-amzn-requestid":"4583d053-e90f-4423-a7a2-37584e35c75e",
"content-type":"application/x-amz-json-1.1",
"content-length":"143",
"date":"Fri, 07 Oct 2022 04:38:29 GMT"
},
"RetryAttempts":0
}
}
Is there anything else in the AWS dashboard related to arn arn:aws:ecs:us-west-1:<account>:container-instance/4efb17d8911a432a80f646c3ea5bc94e
that provides more information on what went wrong?
Sorry, never got an email that you had responded! This error ended up being due the ec2 instance being in a different security group and subnet than the cluster (which was surprising, considering I used aws defaults to set everything up). So no ec2 instances existed with my target security groups, and therefore no ec2 instances matched my requested attributes. Seems like AWS could provide better error messaging!
Regardless, I just successfully spun up a GPU worker on an ec2 instance, using a fargate scheduler! Still working out some of the kinks, but certainly seems like I'm on the right track now
Ok great! I'll close this out then but if you need anything else don't hesitate to comment.
Do you think there are docs improvements we could make to avoid others running into this?
I actually think the improvement should be on AWS's side--like if a user hits a 'no instance has the required attributes' error, they should probably list out the instances that were considered. It would have made this much more apparent, if they showed the subnet being considered, along with 0 instances available in said subnet, rather than just a vague 'failures': [{'arn': 'arn:aws:ecs:us-west-1:<account>:container-instance/4efb17d8911a432a80f646c3ea5bc94e', 'reason': 'ATTRIBUTE'}]
Maybe raise this via AWS support? They are pretty good with listening to user feedback and passing it along to their teams.