sagemaker-python-sdk icon indicating copy to clipboard operation
sagemaker-python-sdk copied to clipboard

Local Mode Endpoints can't be created from hosted training jobs

Open rabowskyb opened this issue 5 years ago • 8 comments

System Information

  • Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): TensorFlow
  • Framework Version: 1.13
  • Python Version: 3.x
  • CPU or GPU: GPU
  • Python SDK Version: latest
  • Are you using a custom image: no, Script Mode for TensorFlow

Describe the problem

Although I can create a Local Mode endpoint from a Local Mode training job, I cannot create one from a hosted training job. See error log below.

Minimal repro / logs

ClientError Traceback (most recent call last) in () ----> 1 predictor = estimator.deploy(initial_instance_count=1,instance_type='local_gpu') 2 #predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.xlarge')

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, use_compiled_model, update_endpoint, wait, **kwargs) 448 update_endpoint=update_endpoint, 449 tags=self.tags, --> 450 wait=wait, 451 ) 452

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, update_endpoint, tags, kms_key, wait) 385 else: 386 self.sagemaker_session.endpoint_from_production_variants( --> 387 self.endpoint_name, [production_variant], tags, kms_key, wait 388 ) 389

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in endpoint_from_production_variants(self, name, production_variants, tags, kms_key, wait) 1216 config_options["KmsKeyId"] = kms_key 1217 -> 1218 self.sagemaker_client.create_endpoint_config(**config_options) 1219 return self.create_endpoint(endpoint_name=name, config_name=name, tags=tags, wait=wait) 1220

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs) 355 "%s() only accepts keyword arguments." % py_operation_name) 356 # The "self" in this scope is referring to the BaseClient. --> 357 return self._make_api_call(operation_name, kwargs) 358 359 _api_call.name = str(py_operation_name)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params) 659 error_code = parsed_response.get("Error", {}).get("Code") 660 error_class = self.exceptions.from_code(error_code) --> 661 raise error_class(parsed_response, operation_name) 662 else: 663 return parsed_response

ClientError: An error occurred (ValidationException) when calling the CreateEndpointConfig operation: 1 validation error detected: Value 'local_gpu' at 'productionVariants.1.member.instanceType' failed to satisfy constraint: Member must satisfy enum value set: [ml.r5.12xlarge, ml.p2.xlarge, ml.m5.4xlarge, ml.m4.16xlarge, ml.r5.24xlarge, ml.p3.16xlarge, ml.m5.large, ml.t2.xlarge, ml.p2.16xlarge, ml.c4.2xlarge, ml.c5.2xlarge, ml.c4.4xlarge, ml.c5.4xlarge, ml.g4dn.xlarge, ml.g4dn.12xlarge, ml.g4dn.2xlarge, ml.c4.8xlarge, ml.c4.large, ml.c5.large, ml.g4dn.4xlarge, ml.c5.9xlarge, ml.g4dn.16xlarge, ml.c5.xlarge, ml.c4.xlarge, ml.g4dn.8xlarge, ml.t2.2xlarge, ml.t2.medium, ml.c5.18xlarge, ml.r5.2xlarge, ml.p3.2xlarge, ml.m5.xlarge, ml.m4.10xlarge, ml.t2.large, ml.r5.4xlarge, ml.m5.12xlarge, ml.m4.xlarge, ml.m5.24xlarge, ml.m4.2xlarge, ml.p2.8xlarge, ml.m5.2xlarge, ml.r5.xlarge, ml.r5.large, ml.p3.8xlarge, ml.m4.4xlarge]

  • Exact command to reproduce:

predictor = estimator.deploy(initial_instance_count=1,instance_type='local_gpu')

rabowskyb avatar Jul 13 '19 00:07 rabowskyb

Hello @rabowskyb,

Thanks for reporting this issue.

Can you provide your Estimator instantiation and runner code?

I'm wondering if a Session object is being passed in the constructor.

ChoiByungWook avatar Jul 16 '19 01:07 ChoiByungWook

Yes, my code is at: https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/keras-embeddings-script-mode/keras-embeddings.ipynb

The line of code that produces the error is under "SageMaker hosted endpoint" when I change predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.xlarge') to predictor = estimator.deploy(initial_instance_count=1,instance_type='local_gpu')

rabowskyb avatar Jul 16 '19 01:07 rabowskyb

Understood.

Looking at the Estimator constructor, the sagemaker_session object is instantiated depending on the train_instance_type provided. In this case it ends up being a Session object meant to work with SageMaker and not the LocalSession that works with local mode. https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/estimator.py#L156

The deploy method looks at the specific Session type and acts accordingly. https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/model.py#L351

I think the right thing to do is to probably have the sagemaker_session be determined by the instance_type arg in the deploy call.

I'll try to make this change and test it against your notebook and then submit a PR if things go smoothly.

ChoiByungWook avatar Jul 16 '19 03:07 ChoiByungWook

Hello, just checking in how this is going. Definitely a lot of interest in this because it will save a lot of time during prototyping.

rabowskyb avatar Jul 18 '19 02:07 rabowskyb

@rabowskyb Thanks for bring this to our attention. While we work on official fix i think there is a work around you can try by setting the estimator.sagemaker_serssion to None before calling deploy.

icywang86rui avatar Aug 01 '19 17:08 icywang86rui

@icywang86rui I tried setting setting the estimator.sagemaker_serssion to None before calling deploy. It results in another error :'NoneType' object has no attribute 'sagemaker_client' Notebook is https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb

hasanp87 avatar Sep 02 '20 12:09 hasanp87

Did you ever get to the bottom of this? I can't get this to work...

from sagemaker.amazon.amazon_estimator import get_image_uri training_image = sagemaker.amazon.amazon_estimator.get_image_uri(boto3.Session().region_name, 'object-detection', repo_version='latest')

training_params =
{ "AlgorithmSpecification": { "TrainingImage": training_image, "TrainingInputMode": "Pipe" }, "RoleArn": role, "OutputDataConfig": { "S3OutputPath": s3_output_path }, "ResourceConfig": { "InstanceCount": 1, "InstanceType": "local_gpu", "VolumeSizeInGB": 200 }, "TrainingJobName": model_job_name, "HyperParameters": hyperparams, "StoppingCondition": { "MaxRuntimeInSeconds": 86400 }, "InputDataConfig": [ train_input, validation_input ] }

from sagemaker.local import LocalSession sagemaker_session = LocalSession()

client = boto3.client(service_name='sagemaker') client.create_training_job(**training_params)

xrstokes avatar Apr 16 '21 02:04 xrstokes

Hi @rabowskyb - does this issue still persist with the latest sagemaker ?

akrishna1995 avatar Dec 27 '23 01:12 akrishna1995

Close the issue cause there is no response for a long time, feel free to reopen if the issue is still there

liujiaorr avatar Apr 28 '24 04:04 liujiaorr