zenml
zenml copied to clipboard
[BUG]: Resource Limit Exceeded on AWS.
Contact Details [Optional]
No response
System Information
ZenML version: 0.11.0
Install path: /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/zenml
Python version: 3.8.13
Platform information: {'os': 'linux', 'linux_distro': 'ubuntu', 'linux_distro_like': 'debian', 'linux_distro_version': '20.04'}
Environment: native
Integrations: ['aws', 'lightgbm', 'mlflow', 'plotly', 'pytorch', 's3', 'scipy', 'sklearn', 'wandb', 'xgboost']
What happened?
I tried running a pipeline on the AWS Sagemaker stack and got the following error
ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p2.xlarge
for training job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an
increase for this limit.
My region is Asia Pacific (Singapore) ap-southeast-1
Reproduction steps
No response
Relevant log output
(zenfiles) dnth@dnth:~/Desktop/zenfiles/image-segmentation$ python run_image_seg_pipeline.py
Creating run for pipeline: image_segmentation_pipeline
Cache enabled for pipeline image_segmentation_pipeline
/home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
from pandas import MultiIndex, Int64Index
Using stack sagemaker_stack_with_wandb to run pipeline image_segmentation_pipeline...
Step apply_augmentations has started.
Using cached version of apply_augmentations.
Step apply_augmentations has finished in 0.025s.
Step initiate_model_and_optimizer has started.
Using cached version of initiate_model_and_optimizer.
Step initiate_model_and_optimizer has finished in 0.021s.
Step prepare_df has started.
Using cached version of prepare_df.
Step prepare_df has finished in 0.022s.
Step create_stratified_fold has started.
Using cached version of create_stratified_fold.
Step create_stratified_fold has finished in 0.035s.
Step prepare_dataloaders has started.
Using cached version of prepare_dataloaders.
Step prepare_dataloaders has finished in 0.047s.
Step train_model has started.
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
Using step operator sagemaker to run step train_model.
Using dockerignore found at path '/home/dnth/Desktop/zenfiles/image-segmentation/.dockerignore' to create docker build context.
Building docker image '715803424590.dkr.ecr.ap-southeast-1.amazonaws.com/zenml-sagemaker:image_segmentation_pipeline', this might take a while...
Step 1/7 : FROM zenmldocker/zenml:0.11.0-py3.8
---> 79d1edfc393e
Step 2/7 : WORKDIR /app
---> Using cache
---> 6be63bda03c2
Step 3/7 : RUN pip install --no-cache 'Pillow>=9.1.0' 'boto3==1.21.21' 's3fs==2022.3.0' 'sagemaker==2.82.2' 'wandb>=0.12.12' 'zenml==0.11.0'
---> Using cache
---> fd4bb1181c1a
Step 4/7 : COPY . .
---> 3e8d71133573
Step 5/7 : RUN chmod -R a+rw .
---> Running in d275c22215d0
---> 409a3f32b150
Step 6/7 : ENV ZENML_CONFIG_PATH=/app/.zenconfig
---> Running in 59331c5cfa79
---> 6c521aa88414
Step 7/7 : ENTRYPOINT python -m zenml.step_operators.entrypoint --main_module run_image_seg_pipeline --step_source_path steps.model_steps.train_model --execution_info_path s3://zenfile-bucket/train_model/.system/executor_execution/107/.temp/zenml_execution_info.pb --input_artifact_types_path s3://zenfile-bucket/train_model/.system/executor_execution/107/.temp/input_artifacts.json
---> Running in 9d6fbf68eddb
---> b417c7fbbe9e
Successfully built b417c7fbbe9e
Successfully tagged 715803424590.dkr.ecr.ap-southeast-1.amazonaws.com/zenml-sagemaker:image_segmentation_pipeline
Finished building docker image.
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
Pushing docker image '715803424590.dkr.ecr.ap-southeast-1.amazonaws.com/zenml-sagemaker:image_segmentation_pipeline'.
Finished pushing docker image.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: image-segmentation-pipeline-15-Aug-22-21-05-39-318118
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/dnth/Desktop/zenfiles/image-segmentation/run_image_seg_pipeline.py:25 in <module> │
│ │
│ 22 │
│ 23 │
│ 24 if __name__ == "__main__": │
│ ❱ 25 │ run_img_seg_pipe() │
│ 26 │
│ │
│ /home/dnth/Desktop/zenfiles/image-segmentation/run_image_seg_pipeline.py:21 in run_img_seg_pipe │
│ │
│ 18 │ │ initiate_model_and_optimizer().with_return_materializers(ImageCustomerMaterializ │
│ 19 │ │ train_model(), │
│ 20 │ ) │
│ ❱ 21 │ image_seg_pipe.run() │
│ 22 │
│ 23 │
│ 24 if __name__ == "__main__": │
│ │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/zenml/pipelines/base_pipeline.py: │
│ 500 in run │
│ │
│ 497 │ │ self._reset_step_flags() │
│ 498 │ │ self.validate_stack(stack) │
│ 499 │ │ │
│ ❱ 500 │ │ return stack.deploy_pipeline( │
│ 501 │ │ │ self, runtime_configuration=runtime_configuration │
│ 502 │ │ ) │
│ 503 │
│ │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/zenml/stack/stack.py:615 in │
│ deploy_pipeline │
│ │
│ 612 │ │ │ pipeline=pipeline, runtime_configuration=runtime_configuration │
│ 613 │ │ ) │
│ 614 │ │ │
│ ❱ 615 │ │ return_value = self.orchestrator.run( │
│ 616 │ │ │ pipeline, stack=self, runtime_configuration=runtime_configuration │
│ 617 │ │ ) │
│ 618 │
│ │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/zenml/orchestrators/base_orchestr │
│ ator.py:262 in run │
│ │
│ 259 │ │ │ pipeline=pipeline, pb2_pipeline=pb2_pipeline │
│ 260 │ │ ) │
│ 261 │ │ │
│ ❱ 262 │ │ result = self.prepare_or_run_pipeline( │
│ 263 │ │ │ sorted_steps=sorted_steps, │
│ 264 │ │ │ pipeline=pipeline, │
│ 265 │ │ │ pb2_pipeline=pb2_pipeline, │
│ │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/zenml/orchestrators/local/local_o │
│ rchestrator.py:68 in prepare_or_run_pipeline │
│ │
│ 65 │ │ │
│ 66 │ │ # Run each step │
│ 67 │ │ for step in sorted_steps: │
│ ❱ 68 │ │ │ self.run_step( │
│ 69 │ │ │ │ step=step, │
│ 70 │ │ │ │ run_name=runtime_configuration.run_name, │
│ 71 │ │ │ │ pb2_pipeline=pb2_pipeline, │
│ │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/zenml/orchestrators/base_orchestr │
│ ator.py:366 in run_step │
│ │
│ 363 │ │ # This is where the step actually gets executed using the │
│ 364 │ │ # component_launcher │
│ 365 │ │ repo.active_stack.prepare_step_run() │
│ ❱ 366 │ │ execution_info = self._execute_step(component_launcher) │
│ 367 │ │ repo.active_stack.cleanup_step_run() │
│ 368 │ │ │
│ 369 │ │ return execution_info │
│ │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/zenml/orchestrators/base_orchestr │
│ ator.py:390 in _execute_step │
│ │
│ 387 │ │ start_time = time.time() │
│ 388 │ │ logger.info(f"Step `{pipeline_step_name}` has started.") │
│ 389 │ │ try: │
│ ❱ 390 │ │ │ execution_info = tfx_launcher.launch() │
│ 391 │ │ │ if execution_info and get_cache_status(execution_info): │
│ 392 │ │ │ │ logger.info(f"Using cached version of `{pipeline_step_name}`.") │
│ 393 │ │ except RuntimeError as e: │
│ │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/tfx/orchestration/portable/launch │
│ er.py:549 in launch │
│ │
│ 546 │ │ self._executor_operator.with_execution_watcher( │
│ 547 │ │ │ executor_watcher.address) │
│ 548 │ │ executor_watcher.start() │
│ ❱ 549 │ │ executor_output = self._run_executor(execution_info) │
│ 550 │ except Exception as e: # pylint: disable=broad-except │
│ 551 │ │ execution_output = ( │
│ 552 │ │ │ e.executor_output if isinstance(e, _ExecutionFailedError) else None) │
│ │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/tfx/orchestration/portable/launch │
│ er.py:424 in _run_executor │
│ │
│ 421 │ │
│ 422 │ outputs_utils.make_output_dirs(execution_info.output_dict) │
│ 423 │ try: │
│ ❱ 424 │ executor_output = self._executor_operator.run_executor(execution_info) │
│ 425 │ code = executor_output.execution_result.code │
│ 426 │ if code != 0: │
│ 427 │ │ result_message = executor_output.execution_result.result_message │
│ │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/zenml/step_operators/step_executo │
│ r_operator.py:280 in run_executor │
│ │
│ 277 │ │ │ requirements, │
│ 278 │ │ │ entrypoint_command, │
│ 279 │ │ ) │
│ ❱ 280 │ │ step_operator.launch( │
│ 281 │ │ │ pipeline_name=execution_info.pipeline_info.id, │
│ 282 │ │ │ run_name=execution_info.pipeline_run_id, │
│ 283 │ │ │ requirements=requirements, │
│ │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/zenml/integrations/aws/step_opera │
│ tors/sagemaker_step_operator.py:146 in launch │
│ │
│ 143 │ │ │ │ "TrialName": sanitized_run_name, │
│ 144 │ │ │ } │
│ 145 │ │ │
│ ❱ 146 │ │ estimator.fit( │
│ 147 │ │ │ wait=True, │
│ 148 │ │ │ experiment_config=experiment_config, │
│ 149 │ │ │ job_name=sanitized_run_name, │
│ │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/sagemaker/estimator.py:952 in fit │
│ │
│ 949 │ │ """ │
│ 950 │ │ self._prepare_for_training(job_name=job_name) │
│ 951 │ │ │
│ ❱ 952 │ │ self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_confi │
│ 953 │ │ self.jobs.append(self.latest_training_job) │
│ 954 │ │ if wait: │
│ 955 │ │ │ self.latest_training_job.wait(logs=logs) │
│ │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/sagemaker/estimator.py:1770 in │
│ start_new │
│ │
│ 1767 │ │ │ all information about the started training job. │
│ 1768 │ │ """ │
│ 1769 │ │ train_args = cls._get_train_args(estimator, inputs, experiment_config) │
│ ❱ 1770 │ │ estimator.sagemaker_session.train(**train_args) │
│ 1771 │ │ │
│ 1772 │ │ return cls(estimator.sagemaker_session, estimator._current_job_name) │
│ 1773 │
│ │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/sagemaker/session.py:590 in train │
│ │
│ 587 │ │ ) │
│ 588 │ │ LOGGER.info("Creating training-job with name: %s", job_name) │
│ 589 │ │ LOGGER.debug("train request: %s", json.dumps(train_request, indent=4)) │
│ ❱ 590 │ │ self.sagemaker_client.create_training_job(**train_request) │
│ 591 │ │
│ 592 │ def _get_train_request( # noqa: C901 │
│ 593 │ │ self, │
│ │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/botocore/client.py:395 in │
│ _api_call │
│ │
│ 392 │ │ │ │ raise TypeError( │
│ 393 │ │ │ │ │ "%s() only accepts keyword arguments." % py_operation_name) │
│ 394 │ │ │ # The "self" in this scope is referring to the BaseClient. │
│ ❱ 395 │ │ │ return self._make_api_call(operation_name, kwargs) │
│ 396 │ │ │
│ 397 │ │ _api_call.__name__ = str(py_operation_name) │
│ 398 │
│ │
│ /home/dnth/anaconda3/envs/zenfiles/lib/python3.8/site-packages/botocore/client.py:725 in │
│ _make_api_call │
│ │
│ 722 │ │ if http.status_code >= 300: │
│ 723 │ │ │ error_code = parsed_response.get("Error", {}).get("Code") │
│ 724 │ │ │ error_class = self.exceptions.from_code(error_code) │
│ ❱ 725 │ │ │ raise error_class(parsed_response, operation_name) │
│ 726 │ │ else: │
│ 727 │ │ │ return parsed_response │
│ 728 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
Code of Conduct
- [X] I agree to follow this project's Code of Conduct