sagemaker-python-sdk
sagemaker-python-sdk copied to clipboard
Local Mode Endpoints can't be created from hosted training jobs
System Information
- Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): TensorFlow
- Framework Version: 1.13
- Python Version: 3.x
- CPU or GPU: GPU
- Python SDK Version: latest
- Are you using a custom image: no, Script Mode for TensorFlow
Describe the problem
Although I can create a Local Mode endpoint from a Local Mode training job, I cannot create one from a hosted training job. See error log below.
Minimal repro / logs
ClientError Traceback (most recent call last)
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, use_compiled_model, update_endpoint, wait, **kwargs) 448 update_endpoint=update_endpoint, 449 tags=self.tags, --> 450 wait=wait, 451 ) 452
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, update_endpoint, tags, kms_key, wait) 385 else: 386 self.sagemaker_session.endpoint_from_production_variants( --> 387 self.endpoint_name, [production_variant], tags, kms_key, wait 388 ) 389
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in endpoint_from_production_variants(self, name, production_variants, tags, kms_key, wait) 1216 config_options["KmsKeyId"] = kms_key 1217 -> 1218 self.sagemaker_client.create_endpoint_config(**config_options) 1219 return self.create_endpoint(endpoint_name=name, config_name=name, tags=tags, wait=wait) 1220
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs) 355 "%s() only accepts keyword arguments." % py_operation_name) 356 # The "self" in this scope is referring to the BaseClient. --> 357 return self._make_api_call(operation_name, kwargs) 358 359 _api_call.name = str(py_operation_name)
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params) 659 error_code = parsed_response.get("Error", {}).get("Code") 660 error_class = self.exceptions.from_code(error_code) --> 661 raise error_class(parsed_response, operation_name) 662 else: 663 return parsed_response
ClientError: An error occurred (ValidationException) when calling the CreateEndpointConfig operation: 1 validation error detected: Value 'local_gpu' at 'productionVariants.1.member.instanceType' failed to satisfy constraint: Member must satisfy enum value set: [ml.r5.12xlarge, ml.p2.xlarge, ml.m5.4xlarge, ml.m4.16xlarge, ml.r5.24xlarge, ml.p3.16xlarge, ml.m5.large, ml.t2.xlarge, ml.p2.16xlarge, ml.c4.2xlarge, ml.c5.2xlarge, ml.c4.4xlarge, ml.c5.4xlarge, ml.g4dn.xlarge, ml.g4dn.12xlarge, ml.g4dn.2xlarge, ml.c4.8xlarge, ml.c4.large, ml.c5.large, ml.g4dn.4xlarge, ml.c5.9xlarge, ml.g4dn.16xlarge, ml.c5.xlarge, ml.c4.xlarge, ml.g4dn.8xlarge, ml.t2.2xlarge, ml.t2.medium, ml.c5.18xlarge, ml.r5.2xlarge, ml.p3.2xlarge, ml.m5.xlarge, ml.m4.10xlarge, ml.t2.large, ml.r5.4xlarge, ml.m5.12xlarge, ml.m4.xlarge, ml.m5.24xlarge, ml.m4.2xlarge, ml.p2.8xlarge, ml.m5.2xlarge, ml.r5.xlarge, ml.r5.large, ml.p3.8xlarge, ml.m4.4xlarge]
- Exact command to reproduce:
predictor = estimator.deploy(initial_instance_count=1,instance_type='local_gpu')
Hello @rabowskyb,
Thanks for reporting this issue.
Can you provide your Estimator instantiation and runner code?
I'm wondering if a Session object is being passed in the constructor.
Yes, my code is at: https://github.com/aws-samples/amazon-sagemaker-script-mode/blob/master/keras-embeddings-script-mode/keras-embeddings.ipynb
The line of code that produces the error is under "SageMaker hosted endpoint" when I change predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.xlarge') to predictor = estimator.deploy(initial_instance_count=1,instance_type='local_gpu')
Understood.
Looking at the Estimator constructor, the sagemaker_session object is instantiated depending on the train_instance_type provided. In this case it ends up being a Session object meant to work with SageMaker and not the LocalSession that works with local mode. https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/estimator.py#L156
The deploy method looks at the specific Session type and acts accordingly. https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/model.py#L351
I think the right thing to do is to probably have the sagemaker_session be determined by the instance_type arg in the deploy call.
I'll try to make this change and test it against your notebook and then submit a PR if things go smoothly.
Hello, just checking in how this is going. Definitely a lot of interest in this because it will save a lot of time during prototyping.
@rabowskyb
Thanks for bring this to our attention. While we work on official fix i think there is a work around you can try by setting the estimator.sagemaker_serssion
to None
before calling deploy
.
@icywang86rui I tried setting setting the estimator.sagemaker_serssion to None before calling deploy. It results in another error :'NoneType' object has no attribute 'sagemaker_client' Notebook is https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb
Did you ever get to the bottom of this? I can't get this to work...
from sagemaker.amazon.amazon_estimator import get_image_uri training_image = sagemaker.amazon.amazon_estimator.get_image_uri(boto3.Session().region_name, 'object-detection', repo_version='latest')
training_params =
{
"AlgorithmSpecification": {
"TrainingImage": training_image,
"TrainingInputMode": "Pipe"
},
"RoleArn": role,
"OutputDataConfig": {
"S3OutputPath": s3_output_path
},
"ResourceConfig": {
"InstanceCount": 1,
"InstanceType": "local_gpu",
"VolumeSizeInGB": 200
},
"TrainingJobName": model_job_name,
"HyperParameters": hyperparams,
"StoppingCondition": {
"MaxRuntimeInSeconds": 86400
},
"InputDataConfig": [
train_input,
validation_input
]
}
from sagemaker.local import LocalSession sagemaker_session = LocalSession()
client = boto3.client(service_name='sagemaker') client.create_training_job(**training_params)
Hi @rabowskyb - does this issue still persist with the latest sagemaker ?
Close the issue cause there is no response for a long time, feel free to reopen if the issue is still there