[Bug Report] Jumpstart LLaMA 2 finetuning
Link to the notebook Fine-tune LLaMA 2 models on SageMaker JumpStart @vivekmadan2
Describe the bug
When running the training job I get this error:
ClientError: Data download failed:Unable to download object s3://sagemaker-repository-dub/model-data-model-package_llama2-7b-v3-740347e540da35b4ab9f6fc0ab3fed2c (AccessDenied: Access Denied)
My role that I assume has a policy which gives read access to that s3 bucket. But that S3 bucket doesn't allows it and is not managed by me.
To reproduce Running on eu-west-1 in Sagemaker studio with Python 3 kernel.
Logs
INFO:sagemaker:Creating training-job with name: meta-textgeneration-llama-2-7b-2023-08-15-19-52-03-298
2023-08-15 19:52:03 Starting - Starting the training job...
2023-08-15 19:52:31 Starting - Preparing the instances for training......
2023-08-15 19:53:34 Downloading - Downloading input data.........
2023-08-15 19:54:45 Failed - Training job failed
..
---------------------------------------------------------------------------
UnexpectedStatusException Traceback (most recent call last)
<ipython-input-4-c71e7ff00495> in <module>
12 # By default, instruction tuning is set to false. Thus, to use instruction tuning dataset you use
13 estimator.set_hyperparameters(instruction_tuned="True", epoch="5")
---> 14 estimator.fit({"training": train_data_location})
/opt/conda/lib/python3.7/site-packages/sagemaker/jumpstart/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
652 )
653
--> 654 return super(JumpStartEstimator, self).fit(**estimator_fit_kwargs.to_kwargs_dict())
655
656 def deploy(
/opt/conda/lib/python3.7/site-packages/sagemaker/workflow/pipeline_context.py in wrapper(*args, **kwargs)
309 return _StepArguments(retrieve_caller_name(self_instance), run_func, *args, **kwargs)
310
--> 311 return run_func(*args, **kwargs)
312
313 return wrapper
/opt/conda/lib/python3.7/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
1290 self.jobs.append(self.latest_training_job)
1291 if wait:
-> 1292 self.latest_training_job.wait(logs=logs)
1293
1294 def _compilation_job_name(self):
/opt/conda/lib/python3.7/site-packages/sagemaker/estimator.py in wait(self, logs)
2472 # If logs are requested, call logs_for_jobs.
2473 if logs != "None":
-> 2474 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
2475 else:
2476 self.sagemaker_session.wait_for_job(self.job_name)
/opt/conda/lib/python3.7/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type, timeout)
4847 exceptions.UnexpectedStatusException: If waiting and the training job fails.
4848 """
-> 4849 _logs_for_job(self.boto_session, job_name, wait, poll, log_type, timeout)
4850
4851 def logs_for_processing_job(self, job_name, wait=False, poll=10):
/opt/conda/lib/python3.7/site-packages/sagemaker/session.py in _logs_for_job(boto_session, job_name, wait, poll, log_type, timeout)
6758
6759 if wait:
-> 6760 _check_job_status(job_name, description, "TrainingJobStatus")
6761 if dot:
6762 print()
/opt/conda/lib/python3.7/site-packages/sagemaker/session.py in _check_job_status(job, desc, status_key_name)
6814 message=message,
6815 allowed_statuses=["Completed", "Stopped"],
-> 6816 actual_status=status,
6817 )
6818
UnexpectedStatusException: Error for Training job meta-textgeneration-llama-2-7b-2023-08-15-19-52-03-298: Failed. Reason: ClientError: Data download failed:Unable to download object s3://sagemaker-repository-dub/model-data-model-package_llama2-7b-v3-740347e540da35b4ab9f6fc0ab3fed2c (AccessDenied: Access Denied)
I'm having a similar error, but in a different bucket: "An error occurred (ValidationException) when calling the CreateTrainingJob operation: No S3 objects found under S3 URL "s3://jumpstart-cache-prod-us-east-1/source-directory-tarballs/meta/transfer_learning/textgeneration/v1.0.1/sourcedir.tar.gz" given in input data source. Please ensure that the bucket exists in the selected region (us-east-1), that objects exist under that S3 prefix, and that the role "arn:aws:iam::XXXXXXXX:role/service-role/XXXXXXXXXXXXXX" has "s3:ListBucket" permissions on bucket "jumpstart-cache-prod-us-east-1". Error message from S3: Access Denied"
Link to the notebook Fine-tune LLaMA 2 models on SageMaker JumpStart @vivekmadan2
Describe the bug When running the training job I get this error:
ClientError: Data download failed:Unable to download object s3://sagemaker-repository-dub/model-data-model-package_llama2-7b-v3-740347e540da35b4ab9f6fc0ab3fed2c (AccessDenied: Access Denied)My role that I assume has a policy which gives read access to that s3 bucket. But that S3 bucket doesn't allows it and is not managed by me.To reproduce Running on eu-west-1 in Sagemaker studio with Python 3 kernel.
Logs
INFO:sagemaker:Creating training-job with name: meta-textgeneration-llama-2-7b-2023-08-15-19-52-03-298 2023-08-15 19:52:03 Starting - Starting the training job... 2023-08-15 19:52:31 Starting - Preparing the instances for training...... 2023-08-15 19:53:34 Downloading - Downloading input data......... 2023-08-15 19:54:45 Failed - Training job failed .. --------------------------------------------------------------------------- UnexpectedStatusException Traceback (most recent call last) <ipython-input-4-c71e7ff00495> in <module> 12 # By default, instruction tuning is set to false. Thus, to use instruction tuning dataset you use 13 estimator.set_hyperparameters(instruction_tuned="True", epoch="5") ---> 14 estimator.fit({"training": train_data_location}) /opt/conda/lib/python3.7/site-packages/sagemaker/jumpstart/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config) 652 ) 653 --> 654 return super(JumpStartEstimator, self).fit(**estimator_fit_kwargs.to_kwargs_dict()) 655 656 def deploy( /opt/conda/lib/python3.7/site-packages/sagemaker/workflow/pipeline_context.py in wrapper(*args, **kwargs) 309 return _StepArguments(retrieve_caller_name(self_instance), run_func, *args, **kwargs) 310 --> 311 return run_func(*args, **kwargs) 312 313 return wrapper /opt/conda/lib/python3.7/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config) 1290 self.jobs.append(self.latest_training_job) 1291 if wait: -> 1292 self.latest_training_job.wait(logs=logs) 1293 1294 def _compilation_job_name(self): /opt/conda/lib/python3.7/site-packages/sagemaker/estimator.py in wait(self, logs) 2472 # If logs are requested, call logs_for_jobs. 2473 if logs != "None": -> 2474 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs) 2475 else: 2476 self.sagemaker_session.wait_for_job(self.job_name) /opt/conda/lib/python3.7/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type, timeout) 4847 exceptions.UnexpectedStatusException: If waiting and the training job fails. 4848 """ -> 4849 _logs_for_job(self.boto_session, job_name, wait, poll, log_type, timeout) 4850 4851 def logs_for_processing_job(self, job_name, wait=False, poll=10): /opt/conda/lib/python3.7/site-packages/sagemaker/session.py in _logs_for_job(boto_session, job_name, wait, poll, log_type, timeout) 6758 6759 if wait: -> 6760 _check_job_status(job_name, description, "TrainingJobStatus") 6761 if dot: 6762 print() /opt/conda/lib/python3.7/site-packages/sagemaker/session.py in _check_job_status(job, desc, status_key_name) 6814 message=message, 6815 allowed_statuses=["Completed", "Stopped"], -> 6816 actual_status=status, 6817 ) 6818 UnexpectedStatusException: Error for Training job meta-textgeneration-llama-2-7b-2023-08-15-19-52-03-298: Failed. Reason: ClientError: Data download failed:Unable to download object s3://sagemaker-repository-dub/model-data-model-package_llama2-7b-v3-740347e540da35b4ab9f6fc0ab3fed2c (AccessDenied: Access Denied)
I'm getting a similar error, were u able to resolve it?
I'm also facing similar error while fine tuning meta llama 2 on SageMaker JumpStart "Error for Training job meta-textgeneration-llama-2-7b-2024-06-17-01-56-32-919: Failed. Reason: ClientError: Data download failed:Failed to download data. ListObjectsV2 failed for s3://genai withaws project 2024/training-datasets/finance, nextToken:[null]: Unable to execute request to S3"