sagemaker-run-notebook
sagemaker-run-notebook copied to clipboard
Processing Job not showing failure of notebook fails
Hi, We discovered that even when a notebook causes an error, the processing job stops in status "complete". What we would need is in the case of a notebook having an exception to show a failed state. I had a look at the execute.py and found this section: https://github.com/aws-samples/sagemaker-run-notebook/blob/master/sagemaker_run_notebook/container/execute.py#L92
except Exception as e:
# Write out an error file. This will be returned as the failureReason in the
# DescribeProcessingJob result.
trc = traceback.format_exc()
# with open(os.path.join(output_path, 'failure'), 'w') as s:
# s.write('Exception during processing: ' + str(e) + '\n' + trc)
# Printing this causes the exception to be in the training job logs, as well.
print("Exception during processing: " + str(e) + "\n" + trc, file=sys.stderr)
# A non-zero exit code causes the training job to be marked as Failed.
# sys.exit(255)
output_notebook = "xyzzy" # Dummy for print, below
I tried to change it to the following and rebuilding the docker image, but the processing job still shows "complete" instead of "failed" on a failure.
except Exception as e:
# Write out an error file. This will be returned as the failureReason in the
# DescribeTrainingJob result.
trc = traceback.format_exc()
with open(os.path.join(os.path.dirname(output_notebook), 'failure'), 'w') as s:
s.write('Exception during execution: ' + str(e) + '\n' + trc)
# Printing this causes the exception to be in the training job logs, as well.
print('Exception during execution: ' + str(e) + '\n' + trc, file=sys.stderr)
# A non-zero exit code causes the training job to be marked as Failed.
sys.exit(255)
I'm pretty sure that's got to do with how AWS Sagemaker determines what is a 'failed' processing job. It only causes it to 'fail' if there are specific kernel issues I think.
I'd recommend using the 'notebook output' part of the extension, it gives you the papermill output of the jupyter notebook, which can give you all the low level logging information :)
The thing is I am currently setting up a State Machine with Step Functions and some of the Steps are using this approach to run notebooks. The problem is that if those fail, the step functions sees the success from the processing job and continues.
There is probably a better way to run the code from our datascientists as part of a step function, but this was a quick and easy way to get it working.
I am working on alternative, but still I think that if there is an error in the notebook the processing job should be marked as failed.
OK I found a solution for this. Additionally to the change I mentioned above, I also changed this: execute.py from:
papermill.execute_notebook(
notebook_file,
output_notebook,
params,
**arg_map,
)
to:
papermill.execute_notebook(
input_path=notebook_file,
output_path=output_notebook,
parameters=params,
log_output=True,
autosave_cell_every=60,
stdout_file=sys.stdout,
stderr_file=sys.stderr,
**arg_map,
)
run_notebook from:
python -u /opt/program/execute.py 2>&1 | stdbuf -o0 tr '\r' '\n'
to:
python -u /opt/program/execute.py