sagemaker-python-sdk icon indicating copy to clipboard operation
sagemaker-python-sdk copied to clipboard

amazon-sagemaker-bert-pytorch failed to run

Open yzhang-github-pub opened this issue 5 years ago • 11 comments

I cloned https://github.com/aws-samples/amazon-sagemaker-bert-pytorch.git in SageMaker, and ran jupyter notebook without any modification, and got error as below:

"UnexpectedStatusException: Error for Training job pytorch-training-2020-10-27-16-28-37-955: Failed. Reason: AlgorithmError: ExecuteUserScriptError: Command "/opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2".

yzhang-github-pub avatar Oct 27 '20 22:10 yzhang-github-pub

Hi,

I’ve also had same kind of problem since the last Monday (just this week). I can’t run pytorch code on SageMaker instance(ml.p3.8xlarge and local mode). However I could run and complete learning jusing same code last Friday.

I got AlgorithmError: ExecuteUserScriptError, but my algorithm log (I use CloudWatch) is :

█████████▊| 207659/210805 [03:27<00:02, 1050.67it/s]#015 99%|█████████▉| 210753/210805 [03:30<00:00, 1048.70it/s]#015100%|██████████| 210805/210805 [03:30<00:00, 999.88it/s]
tqdm outputted garbled characters.

I would like to know any workaround. Thank you very much for your help.

sugspi avatar Oct 29 '20 08:10 sugspi

Hi,

I have the same issue as above. Code that worked perfectly 2 weeks ago and where I changed nothing, now that same code does not run.

There is something seriously wrong with the pytorch estimator of sagemaker. This should be looked at.

Thank you

spacer730 avatar Oct 29 '20 09:10 spacer730

Ran into the same issue but was able to get the training job to run using pytorch framework_version == 1.6.0 and updating the version of transformers==3.5.0. But when trying to deploy an endpoint:

predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

I get this error:

---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-14-0037ac51261d> in <module>
----> 1 predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, use_compiled_model, wait, model_name, kms_key, data_capture_config, tags, **kwargs)
    801             wait=wait,
    802             kms_key=kms_key,
--> 803             data_capture_config=data_capture_config,
    804         )
    805 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, tags, kms_key, wait, data_capture_config, **kwargs)
    528             kms_key=kms_key,
    529             wait=wait,
--> 530             data_capture_config_dict=data_capture_config_dict,
    531         )
    532 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in endpoint_from_production_variants(self, name, production_variants, tags, kms_key, wait, data_capture_config_dict)
   3191 
   3192             self.sagemaker_client.create_endpoint_config(**config_options)
-> 3193         return self.create_endpoint(endpoint_name=name, config_name=name, tags=tags, wait=wait)
   3194 
   3195     def expand_role(self, role):

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in create_endpoint(self, endpoint_name, config_name, tags, wait)
   2709         )
   2710         if wait:
-> 2711             self.wait_for_endpoint(endpoint_name)
   2712         return endpoint_name
   2713 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in wait_for_endpoint(self, endpoint, poll)
   2978                 ),
   2979                 allowed_statuses=["InService"],
-> 2980                 actual_status=status,
   2981             )
   2982         return desc

UnexpectedStatusException: Error hosting endpoint pytorch-training-2020-11-13-22-51-37-707: Failed. Reason:  The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.

Pulling up the cloudwatch logs, I see:

Traceback (most recent call last):
  File "/opt/conda/bin/torch-model-archiver", line 10, in <module>
    sys.exit(generate_model_archive())
  File "/opt/conda/lib/python3.6/site-packages/model_archiver/model_packaging.py", line 60, in generate_model_archive
    package_model(args, manifest=manifest)
  File "/opt/conda/lib/python3.6/site-packages/model_archiver/model_packaging.py", line 37, in package_model
    model_path = ModelExportUtils.copy_artifacts(model_name, **artifact_files)
  File "/opt/conda/lib/python3.6/site-packages/model_archiver/model_packaging_utils.py", line 150, in copy_artifacts
    shutil.copy(path, model_path)
  File "/opt/conda/lib/python3.6/shutil.py", line 245, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/opt/conda/lib/python3.6/shutil.py", line 120, in copyfile
    with open(src, 'rb') as fsrc:

and

FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/model/model.pth'

Did something change in the PyTorch image?

lawwu avatar Nov 13 '20 23:11 lawwu

Updating the requirements.txt to

tqdm
requests==2.22.0
regex
sacremoses
sentencepiece==0.1.91
transformers==3.5.0

and using the pytorch 1.5.0 image worked for me.

lawwu avatar Nov 14 '20 00:11 lawwu

I am getting the exact same issue, using pytorch 1.5.0 does not work for me with the above requirements, nor does the following requirements with 1.6.0:

pycm 
tqdm 
requests==2.22.0 
regex 
sacremoses 
sentencepiece==0.1.91 
# transformers==3.5.0 Does not work with pytorch v 1.5- 1.6 
transformers==4.0.0 `

The sagemaker version in the notebook is as follows:

sagemaker==2.18.0 
sagemaker-pyspark==1.4.1 

NeuroWinter avatar Dec 09 '20 01:12 NeuroWinter

This change to switch to TorchServe broke some workflows when the training job doesn't save the model to "/opt/ml/model/model.pth". See https://github.com/aws/sagemaker-pytorch-inference-toolkit/commit/a3a08d04a0cc4b0cf7074a7839d7e89895cd092d This change is released in the 1.6 PT container. The solution of providing a custom save_pytorch_model method mentioned here should work - https://github.com/aws/sagemaker-pytorch-inference-toolkit/issues/86​. Another workaround is to switch to 1.5 or older versions of PyTorch images.

icywang86rui avatar Dec 10 '20 18:12 icywang86rui

Hi, still facing the same error

himswamy avatar Jan 16 '22 17:01 himswamy

PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages Invoking script with the following command: /opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2 Traceback (most recent call last): File "train_deploy.py", line 14, in from transformers import AdamW, BertForSequenceClassification, BertTokenizer ModuleNotFoundError: No module named 'transformers' 2022-01-16 17:48:12,592 sagemaker-containers ERROR ExecuteUserScriptError: Command "/opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2" Traceback (most recent call last): File "train_deploy.py", line 14, in from transformers import AdamW, BertForSequenceClassification, BertTokenizer ModuleNotFoundError: No module named 'transformers'

2022-01-16 17:48:21 Uploading - Uploading generated training model 2022-01-16 17:48:21 Failed - Training job failed Traceback (most recent call last): File "train_deploy.py", line 14, in from transformers import AdamW, BertForSequenceClassification, BertTokenizer ModuleNotFoundError: No module named 'transformers' 2022-01-16 17:48:11,890 sagemaker-containers ERROR ExecuteUserScriptError: Command "/opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2" Traceback (most recent call last): File "train_deploy.py", line 14, in from transformers import AdamW, BertForSequenceClassification, BertTokenizer ModuleNotFoundError: No module named 'transformers'

UnexpectedStatusException Traceback (most recent call last) in 20 disable_profiler=True, # disable debugger 21 ) ---> 22 estimator.fit({"training": inputs_train, "testing": inputs_test})

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config) 690 self.jobs.append(self.latest_training_job) 691 if wait: --> 692 self.latest_training_job.wait(logs=logs) 693 694 def _compilation_job_name(self):

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs) 1653 # If logs are requested, call logs_for_jobs. 1654 if logs != "None": -> 1655 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs) 1656 else: 1657 self.sagemaker_session.wait_for_job(self.job_name)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type) 3777 3778 if wait: -> 3779 self._check_job_status(job_name, description, "TrainingJobStatus") 3780 if dot: 3781 print()

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name) 3336 ), 3337 allowed_statuses=["Completed", "Stopped"], -> 3338 actual_status=status, 3339 ) 3340

UnexpectedStatusException: Error for Training job pytorch-training-2022-01-16-17-43-52-825: Failed. Reason: AlgorithmError: ExecuteUserScriptError: Command "/opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2" Traceback (most recent call last): File "train_deploy.py", line 14, in from transformers import AdamW, BertForSequenceClassification, BertTokenizer ModuleNotFoundError: No module named 'transformers'

himswamy avatar Jan 16 '22 18:01 himswamy

The issue is not yet addressed. I am trying to deploy a custom Detr model with PyTorch (based on HuggingFace implementation and trained with Pytorch for custom object detection). I am using PT v1.11 and Python 3.8. The weights are structured in the same way that is stated in the aws/pytorch documentation: https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#for-versions-1-2-and-higher

The deployment is performed on a ml.m4.xlarge and the error I get is that the weights are not found.

W-9000-model_1.0-stdout MODEL_LOG - FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/model/model.pt'

Could you please clarify where are the model weights expected to be? Or please provide container Dockerfile in order to check.

georgebakas avatar Jun 15 '22 08:06 georgebakas

Thanks for raising this issue. Can anyone confirm if this issue still persists on the latest sagemaker ?

akrishna1995 avatar Dec 26 '23 23:12 akrishna1995