sagemaker-distributed-training-workshop icon indicating copy to clipboard operation
sagemaker-distributed-training-workshop copied to clipboard

Lab1 training failed at estimator.fit

Open csxwin opened this issue 11 months ago • 0 comments

I'm running lab1 on SageMaker. Image: Pytorch 1.13 Python 3.9 CPU optimized Kernel: Python3.9 Instance: ml.t3.medium

Here's the error message when running estimator.fit

---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
Cell In[17], line 3
      1 # Passing True will halt your kernel, passing False will not. Both create a training job.
      2 # here we are defining the name of the input train channel. you can use whatever name you like! up to 20 channels per job.
----> 3 estimator.fit(wait=True, inputs = {'train':s3_train_path})

File /opt/conda/lib/python3.9/site-packages/sagemaker/workflow/pipeline_context.py:346, in runnable_by_pipeline.<locals>.wrapper(*args, **kwargs)
    342         return context
    344     return _StepArguments(retrieve_caller_name(self_instance), run_func, *args, **kwargs)
--> 346 return run_func(*args, **kwargs)

File /opt/conda/lib/python3.9/site-packages/sagemaker/estimator.py:1341, in EstimatorBase.fit(self, inputs, wait, logs, job_name, experiment_config)
   1339 self.jobs.append(self.latest_training_job)
   1340 if wait:
-> 1341     self.latest_training_job.wait(logs=logs)

File /opt/conda/lib/python3.9/site-packages/sagemaker/estimator.py:2680, in _TrainingJob.wait(self, logs)
   2678 # If logs are requested, call logs_for_jobs.
   2679 if logs != "None":
-> 2680     self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   2681 else:
   2682     self.sagemaker_session.wait_for_job(self.job_name)

File /opt/conda/lib/python3.9/site-packages/sagemaker/session.py:5766, in Session.logs_for_job(self, job_name, wait, poll, log_type, timeout)
   5745 def logs_for_job(self, job_name, wait=False, poll=10, log_type="All", timeout=None):
   5746     """Display logs for a given training job, optionally tailing them until job is complete.
   5747 
   5748     If the output is a tty or a Jupyter cell, it will be color-coded
   (...)
   5764         exceptions.UnexpectedStatusException: If waiting and the training job fails.
   5765     """
-> 5766     _logs_for_job(self, job_name, wait, poll, log_type, timeout)

File /opt/conda/lib/python3.9/site-packages/sagemaker/session.py:7995, in _logs_for_job(sagemaker_session, job_name, wait, poll, log_type, timeout)
   7992             last_profiler_rule_statuses = profiler_rule_statuses
   7994 if wait:
-> 7995     _check_job_status(job_name, description, "TrainingJobStatus")
   7996     if dot:
   7997         print()

File /opt/conda/lib/python3.9/site-packages/sagemaker/session.py:8048, in _check_job_status(job, desc, status_key_name)
   8042 if "CapacityError" in str(reason):
   8043     raise exceptions.CapacityError(
   8044         message=message,
   8045         allowed_statuses=["Completed", "Stopped"],
   8046         actual_status=status,
   8047     )
-> 8048 raise exceptions.UnexpectedStatusException(
   8049     message=message,
   8050     allowed_statuses=["Completed", "Stopped"],
   8051     actual_status=status,
   8052 )

UnexpectedStatusException: Error for Training job shuxucao-ddp-mnist-2024-03-19-03-40-53-406: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "TypeError: Descriptors cannot be created directly.
 If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
 If you cannot immediately regenerate your protos, some other possible workarounds are
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
 
 More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates
 File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
 File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
 File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible
 # may not use this file except in compliance with the License. A copy of
 File "<frozen importlib._bootstrap>", line 991, in _find_and_load
 File "<frozen zipimport>", line 259, in load_module
 File 

The installed pip package protobuf is 3.20.2. Should I run this lab at python3.8?

csxwin avatar Mar 19 '24 03:03 csxwin