amazon-sagemaker-examples
amazon-sagemaker-examples copied to clipboard
[Bug Report] Jumpstart LLaMA 2 finetuning
Link to the notebook Fine-tune LLaMA 2 models on SageMaker JumpStart @vivekmadan2
Describe the bug
When running the training job I get this error:
ClientError: Data download failed:Unable to download object s3://sagemaker-repository-dub/model-data-model-package_llama2-7b-v3-740347e540da35b4ab9f6fc0ab3fed2c (AccessDenied: Access Denied)
My role that I assume has a policy which gives read access to that s3 bucket. But that S3 bucket doesn't allows it and is not managed by me.
To reproduce Running on eu-west-1 in Sagemaker studio with Python 3 kernel.
Logs
INFO:sagemaker:Creating training-job with name: meta-textgeneration-llama-2-7b-2023-08-15-19-52-03-298
2023-08-15 19:52:03 Starting - Starting the training job...
2023-08-15 19:52:31 Starting - Preparing the instances for training......
2023-08-15 19:53:34 Downloading - Downloading input data.........
2023-08-15 19:54:45 Failed - Training job failed
..
---------------------------------------------------------------------------
UnexpectedStatusException Traceback (most recent call last)
<ipython-input-4-c71e7ff00495> in <module>
12 # By default, instruction tuning is set to false. Thus, to use instruction tuning dataset you use
13 estimator.set_hyperparameters(instruction_tuned="True", epoch="5")
---> 14 estimator.fit({"training": train_data_location})
/opt/conda/lib/python3.7/site-packages/sagemaker/jumpstart/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
652 )
653
--> 654 return super(JumpStartEstimator, self).fit(**estimator_fit_kwargs.to_kwargs_dict())
655
656 def deploy(
/opt/conda/lib/python3.7/site-packages/sagemaker/workflow/pipeline_context.py in wrapper(*args, **kwargs)
309 return _StepArguments(retrieve_caller_name(self_instance), run_func, *args, **kwargs)
310
--> 311 return run_func(*args, **kwargs)
312
313 return wrapper
/opt/conda/lib/python3.7/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
1290 self.jobs.append(self.latest_training_job)
1291 if wait:
-> 1292 self.latest_training_job.wait(logs=logs)
1293
1294 def _compilation_job_name(self):
/opt/conda/lib/python3.7/site-packages/sagemaker/estimator.py in wait(self, logs)
2472 # If logs are requested, call logs_for_jobs.
2473 if logs != "None":
-> 2474 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
2475 else:
2476 self.sagemaker_session.wait_for_job(self.job_name)
/opt/conda/lib/python3.7/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type, timeout)
4847 exceptions.UnexpectedStatusException: If waiting and the training job fails.
4848 """
-> 4849 _logs_for_job(self.boto_session, job_name, wait, poll, log_type, timeout)
4850
4851 def logs_for_processing_job(self, job_name, wait=False, poll=10):
/opt/conda/lib/python3.7/site-packages/sagemaker/session.py in _logs_for_job(boto_session, job_name, wait, poll, log_type, timeout)
6758
6759 if wait:
-> 6760 _check_job_status(job_name, description, "TrainingJobStatus")
6761 if dot:
6762 print()
/opt/conda/lib/python3.7/site-packages/sagemaker/session.py in _check_job_status(job, desc, status_key_name)
6814 message=message,
6815 allowed_statuses=["Completed", "Stopped"],
-> 6816 actual_status=status,
6817 )
6818
UnexpectedStatusException: Error for Training job meta-textgeneration-llama-2-7b-2023-08-15-19-52-03-298: Failed. Reason: ClientError: Data download failed:Unable to download object s3://sagemaker-repository-dub/model-data-model-package_llama2-7b-v3-740347e540da35b4ab9f6fc0ab3fed2c (AccessDenied: Access Denied)