sagemaker-distributed-training-workshop
sagemaker-distributed-training-workshop copied to clipboard
Lab1 training failed at estimator.fit
I'm running lab1 on SageMaker. Image: Pytorch 1.13 Python 3.9 CPU optimized Kernel: Python3.9 Instance: ml.t3.medium
Here's the error message when running estimator.fit
---------------------------------------------------------------------------
UnexpectedStatusException Traceback (most recent call last)
Cell In[17], line 3
1 # Passing True will halt your kernel, passing False will not. Both create a training job.
2 # here we are defining the name of the input train channel. you can use whatever name you like! up to 20 channels per job.
----> 3 estimator.fit(wait=True, inputs = {'train':s3_train_path})
File /opt/conda/lib/python3.9/site-packages/sagemaker/workflow/pipeline_context.py:346, in runnable_by_pipeline.<locals>.wrapper(*args, **kwargs)
342 return context
344 return _StepArguments(retrieve_caller_name(self_instance), run_func, *args, **kwargs)
--> 346 return run_func(*args, **kwargs)
File /opt/conda/lib/python3.9/site-packages/sagemaker/estimator.py:1341, in EstimatorBase.fit(self, inputs, wait, logs, job_name, experiment_config)
1339 self.jobs.append(self.latest_training_job)
1340 if wait:
-> 1341 self.latest_training_job.wait(logs=logs)
File /opt/conda/lib/python3.9/site-packages/sagemaker/estimator.py:2680, in _TrainingJob.wait(self, logs)
2678 # If logs are requested, call logs_for_jobs.
2679 if logs != "None":
-> 2680 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
2681 else:
2682 self.sagemaker_session.wait_for_job(self.job_name)
File /opt/conda/lib/python3.9/site-packages/sagemaker/session.py:5766, in Session.logs_for_job(self, job_name, wait, poll, log_type, timeout)
5745 def logs_for_job(self, job_name, wait=False, poll=10, log_type="All", timeout=None):
5746 """Display logs for a given training job, optionally tailing them until job is complete.
5747
5748 If the output is a tty or a Jupyter cell, it will be color-coded
(...)
5764 exceptions.UnexpectedStatusException: If waiting and the training job fails.
5765 """
-> 5766 _logs_for_job(self, job_name, wait, poll, log_type, timeout)
File /opt/conda/lib/python3.9/site-packages/sagemaker/session.py:7995, in _logs_for_job(sagemaker_session, job_name, wait, poll, log_type, timeout)
7992 last_profiler_rule_statuses = profiler_rule_statuses
7994 if wait:
-> 7995 _check_job_status(job_name, description, "TrainingJobStatus")
7996 if dot:
7997 print()
File /opt/conda/lib/python3.9/site-packages/sagemaker/session.py:8048, in _check_job_status(job, desc, status_key_name)
8042 if "CapacityError" in str(reason):
8043 raise exceptions.CapacityError(
8044 message=message,
8045 allowed_statuses=["Completed", "Stopped"],
8046 actual_status=status,
8047 )
-> 8048 raise exceptions.UnexpectedStatusException(
8049 message=message,
8050 allowed_statuses=["Completed", "Stopped"],
8051 actual_status=status,
8052 )
UnexpectedStatusException: Error for Training job shuxucao-ddp-mnist-2024-03-19-03-40-53-406: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "TypeError: Descriptors cannot be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are
1. Downgrade the protobuf package to 3.20.x or lower.
2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates
File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible
# may not use this file except in compliance with the License. A copy of
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen zipimport>", line 259, in load_module
File
The installed pip package protobuf is 3.20.2. Should I run this lab at python3.8?