airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Add fail_on_nonzero_exit parameter to SSM operators for exit code routing

Open ksharlandjiev opened this issue 2 months ago • 2 comments

Problem

SSM operators currently fail when commands return non-zero exit codes, making it impossible to:

  • Route workflows based on different exit codes
  • Handle commands where non-zero exit codes represent valid business states (e.g., partial success, warnings)
  • Implement conditional retry logic based on specific exit codes
  • Migrate from traditional schedulers like Autosys that support exit code routing

Users have been forced to implement manual polling workarounds with custom Python tasks to handle these scenarios.

Proposal

Add a fail_on_nonzero_exit parameter (default: True) to SsmRunCommandOperator, SsmRunCommandCompletedSensor, and SsmRunCommandTrigger.

When set to False:

  • Tasks complete successfully regardless of command exit codes
  • Exit codes can be retrieved with SsmGetCommandInvocationOperator for routing decisions
  • AWS-level failures (TimedOut, Cancelled) still raise exceptions
  • Command-level failures (non-zero exit codes) are tolerated

The default value of True maintains existing behavior for backward compatibility.

ksharlandjiev avatar Nov 03 '25 15:11 ksharlandjiev

My general comments:

* I think the idea is great, that is indeed a feature user might want. Thanks for creating a PR for that

* Creating documentation is great, however, I think the document is way too long. This is only my personal opinion so I would wait to see what others think but if for new parameter we are creating documentation that big, this will be impossible to maintain. Using AI to create code and/or documentation is great but we should also keep in mind, the longer is NOT the better. Again, I support the documentation, but this is way too big to me. As a user I wont probably read it all, and as a developer I am scared we need to maintain that

* The system test is a great idea, could you please move these 3 examples in the current system test?

Thanks for your feedback. I was on the fence myself on the extra docs, and I understand the concern. I'm happy to move all documented patterns to an external article.

ksharlandjiev avatar Nov 03 '25 22:11 ksharlandjiev

Have you run the system test to ensue that it's working correctly?

Thanks for the approval! I’ve added a few additional tests to the system test to cover this change, following @vincbeck’s feedback, and I can confirm that everything executes successfully.

ksharlandjiev avatar Dec 10 '25 23:12 ksharlandjiev