amazon-sagemaker-examples
amazon-sagemaker-examples copied to clipboard
scikit_bring_your_own: Failed Reason: AlgorithmError: Exit Code: 1
Hello,
I'm trying to create a custom model similar to the example "scikit_bring_your_own" as given in AWS Sagemaker examples.
Here's my code :
data_location = sess.upload_data(outdir + 'train.tsv', bucket, prefix +'/training') account = sess.boto_session.client('sts').get_caller_identity()['Account'] region = sess.boto_session.region_name image = '{}.dkr.ecr.{}.amazonaws.com/sagemaker-onevsrest:latest'.format(account, region)
%%time model = sage.estimator.Estimator(image, role, 1, 'ml.m4.2xlarge', output_path="s3://{}/output".format(sess.default_bucket()), sagemaker_session=sess)
model.fit(data_location)
The moment it starts executing the above, I get the below error:
2019-05-10 10:27:41,022 : INFO : Creating training-job with name: sagemaker-onevsrest-2019-05-10-10-27-41-022 2019-05-10 10:27:41 Starting - Starting the training job... 2019-05-10 10:27:43 Starting - Launching requested ML instances...... 2019-05-10 10:28:54 Starting - Preparing the instances for training... 2019-05-10 10:29:39 Downloading - Downloading input data... 2019-05-10 10:30:03 Training - Training image download completed. Training in progress. 2019-05-10 10:30:03 Uploading - Uploading generated training model 2019-05-10 10:30:03 Failed - Training job failed
exec: "train": executable file not found in $PATH
ValueError Traceback (most recent call last)
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name) 234 self.latest_training_job = _TrainingJob.start_new(self, inputs) 235 if wait: --> 236 self.latest_training_job.wait(logs=logs) 237 238 def _compilation_job_name(self):
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs) 591 def wait(self, logs=True): 592 if logs: --> 593 self.sagemaker_session.logs_for_job(self.job_name, wait=True) 594 else: 595 self.sagemaker_session.wait_for_job(self.job_name)
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll) 1219 1220 if wait: -> 1221 self._check_job_status(job_name, description, 'TrainingJobStatus') 1222 if dot: 1223 print()
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name) 915 reason = desc.get('FailureReason', '(No reason provided)') 916 job_type = status_key_name.replace('JobStatus', ' job') --> 917 raise ValueError('Error for {} {}: {} Reason: {}'.format(job_type, job, status, reason)) 918 919 def wait_for_endpoint(self, endpoint, poll=5):
ValueError: Error for Training job sagemaker-onevsrest-2019-05-10-10-27-41-022: Failed Reason: AlgorithmError: Exit Code: 1
Could anyone help me resolve the above error? Thanks in advance!
Hi @kvr2000,
Thank you for trying out Amazon SageMaker.
Without knowing what is inside your container, it will be difficult to troubleshoot based on your error message alone.
Instead, I'd recommend using the latest SageMaker scikit-learn experience if that fits your use case: https://github.com/aws/sagemaker-scikit-learn-container. For example, see this notebook example.
I just ran into the same problem as @kunalranawat. @asadoughi The link to the notebook example above is broken. I will try the other link.
@asadoughi The link is to a project that's much more complicated. Sorting through it to find out how to deal with the S3 problem won't be easy.