amazon-sagemaker-examples icon indicating copy to clipboard operation
amazon-sagemaker-examples copied to clipboard

[Bug Report] Jumpstart LLaMA 2 finetuning

Open timelfrink opened this issue 10 months ago • 3 comments

Link to the notebook Fine-tune LLaMA 2 models on SageMaker JumpStart @vivekmadan2

Describe the bug When running the training job I get this error: ClientError: Data download failed:Unable to download object s3://sagemaker-repository-dub/model-data-model-package_llama2-7b-v3-740347e540da35b4ab9f6fc0ab3fed2c (AccessDenied: Access Denied) My role that I assume has a policy which gives read access to that s3 bucket. But that S3 bucket doesn't allows it and is not managed by me.

To reproduce Running on eu-west-1 in Sagemaker studio with Python 3 kernel.

Logs

INFO:sagemaker:Creating training-job with name: meta-textgeneration-llama-2-7b-2023-08-15-19-52-03-298

2023-08-15 19:52:03 Starting - Starting the training job...
2023-08-15 19:52:31 Starting - Preparing the instances for training......
2023-08-15 19:53:34 Downloading - Downloading input data.........
2023-08-15 19:54:45 Failed - Training job failed
..

---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-4-c71e7ff00495> in <module>
     12 # By default, instruction tuning is set to false. Thus, to use instruction tuning dataset you use
     13 estimator.set_hyperparameters(instruction_tuned="True", epoch="5")
---> 14 estimator.fit({"training": train_data_location})

/opt/conda/lib/python3.7/site-packages/sagemaker/jumpstart/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
    652         )
    653 
--> 654         return super(JumpStartEstimator, self).fit(**estimator_fit_kwargs.to_kwargs_dict())
    655 
    656     def deploy(

/opt/conda/lib/python3.7/site-packages/sagemaker/workflow/pipeline_context.py in wrapper(*args, **kwargs)
    309             return _StepArguments(retrieve_caller_name(self_instance), run_func, *args, **kwargs)
    310 
--> 311         return run_func(*args, **kwargs)
    312 
    313     return wrapper

/opt/conda/lib/python3.7/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
   1290         self.jobs.append(self.latest_training_job)
   1291         if wait:
-> 1292             self.latest_training_job.wait(logs=logs)
   1293 
   1294     def _compilation_job_name(self):

/opt/conda/lib/python3.7/site-packages/sagemaker/estimator.py in wait(self, logs)
   2472         # If logs are requested, call logs_for_jobs.
   2473         if logs != "None":
-> 2474             self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   2475         else:
   2476             self.sagemaker_session.wait_for_job(self.job_name)

/opt/conda/lib/python3.7/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type, timeout)
   4847             exceptions.UnexpectedStatusException: If waiting and the training job fails.
   4848         """
-> 4849         _logs_for_job(self.boto_session, job_name, wait, poll, log_type, timeout)
   4850 
   4851     def logs_for_processing_job(self, job_name, wait=False, poll=10):

/opt/conda/lib/python3.7/site-packages/sagemaker/session.py in _logs_for_job(boto_session, job_name, wait, poll, log_type, timeout)
   6758 
   6759     if wait:
-> 6760         _check_job_status(job_name, description, "TrainingJobStatus")
   6761         if dot:
   6762             print()

/opt/conda/lib/python3.7/site-packages/sagemaker/session.py in _check_job_status(job, desc, status_key_name)
   6814             message=message,
   6815             allowed_statuses=["Completed", "Stopped"],
-> 6816             actual_status=status,
   6817         )
   6818 

UnexpectedStatusException: Error for Training job meta-textgeneration-llama-2-7b-2023-08-15-19-52-03-298: Failed. Reason: ClientError: Data download failed:Unable to download object s3://sagemaker-repository-dub/model-data-model-package_llama2-7b-v3-740347e540da35b4ab9f6fc0ab3fed2c (AccessDenied: Access Denied)


timelfrink avatar Aug 15 '23 20:08 timelfrink