[BUG]: Resource Limit Exceeded on AWS.

Open dnth opened this issue 3 years ago • 0 comments

Contact Details [Optional]

No response

System Information

ZenML version: 0.11.0
Install path: /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/zenml
Python version: 3.8.13
Platform information: {'os': 'linux', 'linux_distro': 'ubuntu', 'linux_distro_like': 'debian', 'linux_distro_version': '20.04'}
Environment: native
Integrations: ['aws', 'lightgbm', 'mlflow', 'plotly', 'pytorch', 's3', 'scipy', 'sklearn', 'wandb', 'xgboost']

What happened?

I tried running a pipeline on the AWS Sagemaker stack and got the following error

ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p2.xlarge
for training job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an
increase for this limit.

My region is Asia Pacific (Singapore) ap-southeast-1

Reproduction steps

No response

Relevant log output

(zenfiles) dnth@dnth:~/Desktop/zenfiles/image-segmentation$ python run_image_seg_pipeline.py 
Creating run for pipeline: image_segmentation_pipeline
Cache enabled for pipeline image_segmentation_pipeline
/home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index
Using stack sagemaker_stack_with_wandb to run pipeline image_segmentation_pipeline...
Step apply_augmentations has started.
Using cached version of apply_augmentations.
Step apply_augmentations has finished in 0.025s.
Step initiate_model_and_optimizer has started.
Using cached version of initiate_model_and_optimizer.
Step initiate_model_and_optimizer has finished in 0.021s.
Step prepare_df has started.
Using cached version of prepare_df.
Step prepare_df has finished in 0.022s.
Step create_stratified_fold has started.
Using cached version of create_stratified_fold.
Step create_stratified_fold has finished in 0.035s.
Step prepare_dataloaders has started.
Using cached version of prepare_dataloaders.
Step prepare_dataloaders has finished in 0.047s.
Step train_model has started.
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
Using step operator sagemaker to run step train_model.
Using dockerignore found at path '/home/dnth/Desktop/zenfiles/image-segmentation/.dockerignore' to create docker build context.
Building docker image '715803424590.dkr.ecr.ap-southeast-1.amazonaws.com/zenml-sagemaker:image_segmentation_pipeline', this might take a while...
Step 1/7 : FROM zenmldocker/zenml:0.11.0-py3.8

---> 79d1edfc393e
Step 2/7 : WORKDIR /app

---> Using cache
---> 6be63bda03c2
Step 3/7 : RUN pip install --no-cache 'Pillow>=9.1.0' 'boto3==1.21.21' 's3fs==2022.3.0' 'sagemaker==2.82.2' 'wandb>=0.12.12' 'zenml==0.11.0'

---> Using cache
---> fd4bb1181c1a
Step 4/7 : COPY . .

---> 3e8d71133573
Step 5/7 : RUN chmod -R a+rw .

---> Running in d275c22215d0
---> 409a3f32b150
Step 6/7 : ENV ZENML_CONFIG_PATH=/app/.zenconfig

---> Running in 59331c5cfa79
---> 6c521aa88414
Step 7/7 : ENTRYPOINT python -m zenml.step_operators.entrypoint --main_module run_image_seg_pipeline --step_source_path steps.model_steps.train_model --execution_info_path s3://zenfile-bucket/train_model/.system/executor_execution/107/.temp/zenml_execution_info.pb --input_artifact_types_path s3://zenfile-bucket/train_model/.system/executor_execution/107/.temp/input_artifacts.json

---> Running in 9d6fbf68eddb
---> b417c7fbbe9e
Successfully built b417c7fbbe9e
Successfully tagged 715803424590.dkr.ecr.ap-southeast-1.amazonaws.com/zenml-sagemaker:image_segmentation_pipeline
Finished building docker image.
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
Pushing docker image '715803424590.dkr.ecr.ap-southeast-1.amazonaws.com/zenml-sagemaker:image_segmentation_pipeline'.
Finished pushing docker image.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: image-segmentation-pipeline-15-Aug-22-21-05-39-318118
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/dnth/Desktop/zenfiles/image-segmentation/run_image_seg_pipeline.py:25 in <module>          │
│                                                                                                  │
│   22                                                                                             │
│   23                                                                                             │
│   24 if __name__ == "__main__":                                                                  │
│ ❱ 25 │   run_img_seg_pipe()                                                                      │
│   26                                                                                             │
│                                                                                                  │
│ /home/dnth/Desktop/zenfiles/image-segmentation/run_image_seg_pipeline.py:21 in run_img_seg_pipe  │
│                                                                                                  │
│   18 │   │   initiate_model_and_optimizer().with_return_materializers(ImageCustomerMaterializ    │
│   19 │   │   train_model(),                                                                      │
│   20 │   )                                                                                       │
│ ❱ 21 │   image_seg_pipe.run()                                                                    │
│   22                                                                                             │
│   23                                                                                             │
│   24 if __name__ == "__main__":                                                                  │
│                                                                                                  │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/zenml/pipelines/base_pipeline.py: │
│ 500 in run                                                                                       │
│                                                                                                  │
│   497 │   │   self._reset_step_flags()                                                           │
│   498 │   │   self.validate_stack(stack)                                                         │
│   499 │   │                                                                                      │
│ ❱ 500 │   │   return stack.deploy_pipeline(                                                      │
│   501 │   │   │   self, runtime_configuration=runtime_configuration                              │
│   502 │   │   )                                                                                  │
│   503                                                                                            │
│                                                                                                  │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/zenml/stack/stack.py:615 in       │
│ deploy_pipeline                                                                                  │
│                                                                                                  │
│   612 │   │   │   pipeline=pipeline, runtime_configuration=runtime_configuration                 │
│   613 │   │   )                                                                                  │
│   614 │   │                                                                                      │
│ ❱ 615 │   │   return_value = self.orchestrator.run(                                              │
│   616 │   │   │   pipeline, stack=self, runtime_configuration=runtime_configuration              │
│   617 │   │   )                                                                                  │
│   618                                                                                            │
│                                                                                                  │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/zenml/orchestrators/base_orchestr │
│ ator.py:262 in run                                                                               │
│                                                                                                  │
│   259 │   │   │   pipeline=pipeline, pb2_pipeline=pb2_pipeline                                   │
│   260 │   │   )                                                                                  │
│   261 │   │                                                                                      │
│ ❱ 262 │   │   result = self.prepare_or_run_pipeline(                                             │
│   263 │   │   │   sorted_steps=sorted_steps,                                                     │
│   264 │   │   │   pipeline=pipeline,                                                             │
│   265 │   │   │   pb2_pipeline=pb2_pipeline,                                                     │
│                                                                                                  │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/zenml/orchestrators/local/local_o │
│ rchestrator.py:68 in prepare_or_run_pipeline                                                     │
│                                                                                                  │
│   65 │   │                                                                                       │
│   66 │   │   # Run each step                                                                     │
│   67 │   │   for step in sorted_steps:                                                           │
│ ❱ 68 │   │   │   self.run_step(                                                                  │
│   69 │   │   │   │   step=step,                                                                  │
│   70 │   │   │   │   run_name=runtime_configuration.run_name,                                    │
│   71 │   │   │   │   pb2_pipeline=pb2_pipeline,                                                  │
│                                                                                                  │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/zenml/orchestrators/base_orchestr │
│ ator.py:366 in run_step                                                                          │
│                                                                                                  │
│   363 │   │   # This is where the step actually gets executed using the                          │
│   364 │   │   # component_launcher                                                               │
│   365 │   │   repo.active_stack.prepare_step_run()                                               │
│ ❱ 366 │   │   execution_info = self._execute_step(component_launcher)                            │
│   367 │   │   repo.active_stack.cleanup_step_run()                                               │
│   368 │   │                                                                                      │
│   369 │   │   return execution_info                                                              │
│                                                                                                  │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/zenml/orchestrators/base_orchestr │
│ ator.py:390 in _execute_step                                                                     │
│                                                                                                  │
│   387 │   │   start_time = time.time()                                                           │
│   388 │   │   logger.info(f"Step `{pipeline_step_name}` has started.")                           │
│   389 │   │   try:                                                                               │
│ ❱ 390 │   │   │   execution_info = tfx_launcher.launch()                                         │
│   391 │   │   │   if execution_info and get_cache_status(execution_info):                        │
│   392 │   │   │   │   logger.info(f"Using cached version of `{pipeline_step_name}`.")            │
│   393 │   │   except RuntimeError as e:                                                          │
│                                                                                                  │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/tfx/orchestration/portable/launch │
│ er.py:549 in launch                                                                              │
│                                                                                                  │
│   546 │   │     self._executor_operator.with_execution_watcher(                                  │
│   547 │   │   │     executor_watcher.address)                                                    │
│   548 │   │     executor_watcher.start()                                                         │
│ ❱ 549 │   │   executor_output = self._run_executor(execution_info)                               │
│   550 │     except Exception as e:  # pylint: disable=broad-except                               │
│   551 │   │   execution_output = (                                                               │
│   552 │   │   │   e.executor_output if isinstance(e, _ExecutionFailedError) else None)           │
│                                                                                                  │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/tfx/orchestration/portable/launch │
│ er.py:424 in _run_executor                                                                       │
│                                                                                                  │
│   421 │                                                                                          │
│   422 │   outputs_utils.make_output_dirs(execution_info.output_dict)                             │
│   423 │   try:                                                                                   │
│ ❱ 424 │     executor_output = self._executor_operator.run_executor(execution_info)               │
│   425 │     code = executor_output.execution_result.code                                         │
│   426 │     if code != 0:                                                                        │
│   427 │   │   result_message = executor_output.execution_result.result_message                   │
│                                                                                                  │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/zenml/step_operators/step_executo │
│ r_operator.py:280 in run_executor                                                                │
│                                                                                                  │
│   277 │   │   │   requirements,                                                                  │
│   278 │   │   │   entrypoint_command,                                                            │
│   279 │   │   )                                                                                  │
│ ❱ 280 │   │   step_operator.launch(                                                              │
│   281 │   │   │   pipeline_name=execution_info.pipeline_info.id,                                 │
│   282 │   │   │   run_name=execution_info.pipeline_run_id,                                       │
│   283 │   │   │   requirements=requirements,                                                     │
│                                                                                                  │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/zenml/integrations/aws/step_opera │
│ tors/sagemaker_step_operator.py:146 in launch                                                    │
│                                                                                                  │
│   143 │   │   │   │   "TrialName": sanitized_run_name,                                           │
│   144 │   │   │   }                                                                              │
│   145 │   │                                                                                      │
│ ❱ 146 │   │   estimator.fit(                                                                     │
│   147 │   │   │   wait=True,                                                                     │
│   148 │   │   │   experiment_config=experiment_config,                                           │
│   149 │   │   │   job_name=sanitized_run_name,                                                   │
│                                                                                                  │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/sagemaker/estimator.py:952 in fit │
│                                                                                                  │
│    949 │   │   """                                                                               │
│    950 │   │   self._prepare_for_training(job_name=job_name)                                     │
│    951 │   │                                                                                     │
│ ❱  952 │   │   self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_confi  │
│    953 │   │   self.jobs.append(self.latest_training_job)                                        │
│    954 │   │   if wait:                                                                          │
│    955 │   │   │   self.latest_training_job.wait(logs=logs)                                      │
│                                                                                                  │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/sagemaker/estimator.py:1770 in    │
│ start_new                                                                                        │
│                                                                                                  │
│   1767 │   │   │   all information about the started training job.                               │
│   1768 │   │   """                                                                               │
│   1769 │   │   train_args = cls._get_train_args(estimator, inputs, experiment_config)            │
│ ❱ 1770 │   │   estimator.sagemaker_session.train(**train_args)                                   │
│   1771 │   │                                                                                     │
│   1772 │   │   return cls(estimator.sagemaker_session, estimator._current_job_name)              │
│   1773                                                                                           │
│                                                                                                  │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/sagemaker/session.py:590 in train │
│                                                                                                  │
│    587 │   │   )                                                                                 │
│    588 │   │   LOGGER.info("Creating training-job with name: %s", job_name)                      │
│    589 │   │   LOGGER.debug("train request: %s", json.dumps(train_request, indent=4))            │
│ ❱  590 │   │   self.sagemaker_client.create_training_job(**train_request)                        │
│    591 │                                                                                         │
│    592 │   def _get_train_request(  # noqa: C901                                                 │
│    593 │   │   self,                                                                             │
│                                                                                                  │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/botocore/client.py:395 in         │
│ _api_call                                                                                        │
│                                                                                                  │
│   392 │   │   │   │   raise TypeError(                                                           │
│   393 │   │   │   │   │   "%s() only accepts keyword arguments." % py_operation_name)            │
│   394 │   │   │   # The "self" in this scope is referring to the BaseClient.                     │
│ ❱ 395 │   │   │   return self._make_api_call(operation_name, kwargs)                             │
│   396 │   │                                                                                      │
│   397 │   │   _api_call.__name__ = str(py_operation_name)                                        │
│   398                                                                                            │
│                                                                                                  │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/botocore/client.py:725 in         │
│ _make_api_call                                                                                   │
│                                                                                                  │
│   722 │   │   if http.status_code >= 300:                                                        │
│   723 │   │   │   error_code = parsed_response.get("Error", {}).get("Code")                      │
│   724 │   │   │   error_class = self.exceptions.from_code(error_code)                            │
│ ❱ 725 │   │   │   raise error_class(parsed_response, operation_name)                             │
│   726 │   │   else:                                                                              │
│   727 │   │   │   return parsed_response                                                         │
│   728                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Aug 15 '22 13:08 dnth