Have `operator-sdk scorecard` return different exit codes when it fails a check than when it encounters a fatal error
Feature Request
Describe the problem you need a feature to resolve.
I'm writing some automation that calls out to operator-sdk scorecard, and have noticed that it will return the same exit code (1) if the tool either encountered a fatal error, or succeeded in execution but the bundle being tested did not pass a test.
Effectively, this means that I must parse stdout/stderr to determine if the tool was able to parse the bundle at all, or encountered a fatal error in the process. I'm assuming that having nothing in stderr implies the tool did not encounter an error in execution, so I can then parse stdout which should contain the results.
As an example, I'm running something like this
# operator-sdk scorecard --output json --selector=suite=olm my-bundle-image-here:latest
*** JSON OUTPUT OMITTED, but the tool completed and
the output reflected that my bundle failed three tests ***
# echo $?
1
If I ran that with a bogus bundle image, it would also return 1, but I would see something a Fatal log line in stderr.
Describe the solution you'd like.
The easiest thing conceptually is to return a different non-zero exit code if it's a test failure vs. if it's a fatal error. So if any test failed, return 2, vs. returning 1 if the tool encountered a fatal error preventing it from executing.
As a semantic, I do feel like the tool should return 0 even if a test fails (to tool itself worked!). But I also understand how that might make certain automation scenarios a bit trickier (now the developer has to parse the output instead of just relying on the exit code to tell them that something needs their attention). With that in mind, the above solution feels more beneficial.
Technically changing the return code for a failed test or a scorecard command error by default will be a breaking change, so there should be a flag to indicate there should be a different return value. As @jmrodri suggested in the most recent triage meeting, this flag should be named and perhaps accept parameters in a forward-compatible way, i.e. if there are more exit codes we want to add in the future.
I propose --test-exit-code[=2], which is obvious in name and can be modified to accept either an integer or some key=code value later for test- or scenario-specific codes. The default of 2 should be documented. Thoughts @theishshah @jmrodri @asmacdo?
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
/remove-lifecycle stale
Currently, in preflight, we are having to look for "FATA" in stderr to determine whether operator-sdk scorecard returned via 'log.Fatal' or, in the case of just a failed test, os.Exit(1). I am curious how others that rely on the exit code currently make this determination? I ask this because of the "technically a breaking change" comment. It seems to me that these are separate events (an actual error vs. a failed test).
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
/remove-lifecycle stale This is still an issue. While we have determined that the output differs based on whether there is a TTY or not, it still remains that an error vs a failed test are different outcomes. I would like an explanation why this would be considered a breaking change, other than "that's how it currently is".
/lifecycle frozen