sagemaker-python-sdk
sagemaker-python-sdk copied to clipboard
amazon-sagemaker-bert-pytorch failed to run
I cloned https://github.com/aws-samples/amazon-sagemaker-bert-pytorch.git in SageMaker, and ran jupyter notebook without any modification, and got error as below:
"UnexpectedStatusException: Error for Training job pytorch-training-2020-10-27-16-28-37-955: Failed. Reason: AlgorithmError: ExecuteUserScriptError: Command "/opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2".
Hi,
I’ve also had same kind of problem since the last Monday (just this week). I can’t run pytorch code on SageMaker instance(ml.p3.8xlarge and local mode). However I could run and complete learning jusing same code last Friday.
I got AlgorithmError: ExecuteUserScriptError, but my algorithm log (I use CloudWatch) is :
| ââââââââââ| 207659/210805 [03:27<00:02, 1050.67it/s]#015 99%|ââââââââââ| 210753/210805 [03:30<00:00, 1048.70it/s]#015100%|ââââââââââ| 210805/210805 [03:30<00:00, 999.88it/s] |
|---|
| tqdm outputted garbled characters. |
I would like to know any workaround. Thank you very much for your help.
Hi,
I have the same issue as above. Code that worked perfectly 2 weeks ago and where I changed nothing, now that same code does not run.
There is something seriously wrong with the pytorch estimator of sagemaker. This should be looked at.
Thank you
Ran into the same issue but was able to get the training job to run using pytorch framework_version == 1.6.0 and updating the version of transformers==3.5.0. But when trying to deploy an endpoint:
predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")
I get this error:
---------------------------------------------------------------------------
UnexpectedStatusException Traceback (most recent call last)
<ipython-input-14-0037ac51261d> in <module>
----> 1 predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, use_compiled_model, wait, model_name, kms_key, data_capture_config, tags, **kwargs)
801 wait=wait,
802 kms_key=kms_key,
--> 803 data_capture_config=data_capture_config,
804 )
805
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, tags, kms_key, wait, data_capture_config, **kwargs)
528 kms_key=kms_key,
529 wait=wait,
--> 530 data_capture_config_dict=data_capture_config_dict,
531 )
532
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in endpoint_from_production_variants(self, name, production_variants, tags, kms_key, wait, data_capture_config_dict)
3191
3192 self.sagemaker_client.create_endpoint_config(**config_options)
-> 3193 return self.create_endpoint(endpoint_name=name, config_name=name, tags=tags, wait=wait)
3194
3195 def expand_role(self, role):
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in create_endpoint(self, endpoint_name, config_name, tags, wait)
2709 )
2710 if wait:
-> 2711 self.wait_for_endpoint(endpoint_name)
2712 return endpoint_name
2713
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in wait_for_endpoint(self, endpoint, poll)
2978 ),
2979 allowed_statuses=["InService"],
-> 2980 actual_status=status,
2981 )
2982 return desc
UnexpectedStatusException: Error hosting endpoint pytorch-training-2020-11-13-22-51-37-707: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.
Pulling up the cloudwatch logs, I see:
Traceback (most recent call last):
File "/opt/conda/bin/torch-model-archiver", line 10, in <module>
sys.exit(generate_model_archive())
File "/opt/conda/lib/python3.6/site-packages/model_archiver/model_packaging.py", line 60, in generate_model_archive
package_model(args, manifest=manifest)
File "/opt/conda/lib/python3.6/site-packages/model_archiver/model_packaging.py", line 37, in package_model
model_path = ModelExportUtils.copy_artifacts(model_name, **artifact_files)
File "/opt/conda/lib/python3.6/site-packages/model_archiver/model_packaging_utils.py", line 150, in copy_artifacts
shutil.copy(path, model_path)
File "/opt/conda/lib/python3.6/shutil.py", line 245, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/opt/conda/lib/python3.6/shutil.py", line 120, in copyfile
with open(src, 'rb') as fsrc:
and
FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/model/model.pth'
Did something change in the PyTorch image?
Updating the requirements.txt to
tqdm
requests==2.22.0
regex
sacremoses
sentencepiece==0.1.91
transformers==3.5.0
and using the pytorch 1.5.0 image worked for me.
I am getting the exact same issue, using pytorch 1.5.0 does not work for me with the above requirements, nor does the following requirements with 1.6.0:
pycm
tqdm
requests==2.22.0
regex
sacremoses
sentencepiece==0.1.91
# transformers==3.5.0 Does not work with pytorch v 1.5- 1.6
transformers==4.0.0 `
The sagemaker version in the notebook is as follows:
sagemaker==2.18.0
sagemaker-pyspark==1.4.1
This change to switch to TorchServe broke some workflows when the training job doesn't save the model to "/opt/ml/model/model.pth". See https://github.com/aws/sagemaker-pytorch-inference-toolkit/commit/a3a08d04a0cc4b0cf7074a7839d7e89895cd092d This change is released in the 1.6 PT container. The solution of providing a custom save_pytorch_model method mentioned here should work - https://github.com/aws/sagemaker-pytorch-inference-toolkit/issues/86. Another workaround is to switch to 1.5 or older versions of PyTorch images.
Hi, still facing the same error
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages
Invoking script with the following command:
/opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2
Traceback (most recent call last):
File "train_deploy.py", line 14, in
2022-01-16 17:48:21 Uploading - Uploading generated training model
2022-01-16 17:48:21 Failed - Training job failed
Traceback (most recent call last):
File "train_deploy.py", line 14, in
from transformers import AdamW, BertForSequenceClassification, BertTokenizer
ModuleNotFoundError: No module named 'transformers'
2022-01-16 17:48:11,890 sagemaker-containers ERROR ExecuteUserScriptError:
Command "/opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2"
Traceback (most recent call last):
File "train_deploy.py", line 14, in
from transformers import AdamW, BertForSequenceClassification, BertTokenizer
ModuleNotFoundError: No module named 'transformers'
UnexpectedStatusException Traceback (most recent call last)
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config) 690 self.jobs.append(self.latest_training_job) 691 if wait: --> 692 self.latest_training_job.wait(logs=logs) 693 694 def _compilation_job_name(self):
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs) 1653 # If logs are requested, call logs_for_jobs. 1654 if logs != "None": -> 1655 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs) 1656 else: 1657 self.sagemaker_session.wait_for_job(self.job_name)
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type) 3777 3778 if wait: -> 3779 self._check_job_status(job_name, description, "TrainingJobStatus") 3780 if dot: 3781 print()
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name) 3336 ), 3337 allowed_statuses=["Completed", "Stopped"], -> 3338 actual_status=status, 3339 ) 3340
UnexpectedStatusException: Error for Training job pytorch-training-2022-01-16-17-43-52-825: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2"
Traceback (most recent call last):
File "train_deploy.py", line 14, in
The issue is not yet addressed. I am trying to deploy a custom Detr model with PyTorch (based on HuggingFace implementation and trained with Pytorch for custom object detection). I am using PT v1.11 and Python 3.8. The weights are structured in the same way that is stated in the aws/pytorch documentation:
https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#for-versions-1-2-and-higher
The deployment is performed on a ml.m4.xlarge and the error I get is that the weights are not found.
W-9000-model_1.0-stdout MODEL_LOG - FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/model/model.pt'
Could you please clarify where are the model weights expected to be? Or please provide container Dockerfile in order to check.
Thanks for raising this issue. Can anyone confirm if this issue still persists on the latest sagemaker ?