Add fail_on_nonzero_exit parameter to SSM operators for exit code routing
Problem
SSM operators currently fail when commands return non-zero exit codes, making it impossible to:
- Route workflows based on different exit codes
- Handle commands where non-zero exit codes represent valid business states (e.g., partial success, warnings)
- Implement conditional retry logic based on specific exit codes
- Migrate from traditional schedulers like Autosys that support exit code routing
Users have been forced to implement manual polling workarounds with custom Python tasks to handle these scenarios.
Proposal
Add a fail_on_nonzero_exit parameter (default: True) to SsmRunCommandOperator, SsmRunCommandCompletedSensor, and SsmRunCommandTrigger.
When set to False:
- Tasks complete successfully regardless of command exit codes
- Exit codes can be retrieved with
SsmGetCommandInvocationOperatorfor routing decisions - AWS-level failures (TimedOut, Cancelled) still raise exceptions
- Command-level failures (non-zero exit codes) are tolerated
The default value of True maintains existing behavior for backward compatibility.
My general comments:
* I think the idea is great, that is indeed a feature user might want. Thanks for creating a PR for that * Creating documentation is great, however, I think the document is way too long. This is only my personal opinion so I would wait to see what others think but if for new parameter we are creating documentation that big, this will be impossible to maintain. Using AI to create code and/or documentation is great but we should also keep in mind, the longer is NOT the better. Again, I support the documentation, but this is way too big to me. As a user I wont probably read it all, and as a developer I am scared we need to maintain that * The system test is a great idea, could you please move these 3 examples in the current system test?
Thanks for your feedback. I was on the fence myself on the extra docs, and I understand the concern. I'm happy to move all documented patterns to an external article.
Have you run the system test to ensue that it's working correctly?
Thanks for the approval! I’ve added a few additional tests to the system test to cover this change, following @vincbeck’s feedback, and I can confirm that everything executes successfully.