agent icon indicating copy to clipboard operation
agent copied to clipboard

Make failed job acquisitions return a specific exit code (27)

Open moskyb opened this issue 1 year ago • 0 comments

Description

When the agent is running in acquire mode - that is, by specifying a job UUID for the agent to pick up, "inverting" the normal dispatch process - sometimes, the job it's been instructed to pick up in unavailable. Perhaps it's already running on another agent, or maybe the job was cancelled, but made its way to the agent anyway.

In these situations, the agent prints the error to stdout, and returns an exit code of 1, indicating failure.

However, acquisition can fail for both recoverable- and non-recoverable reasons, and the agent doesn't differentiate between these in its status code. Users can grep through the logs for things unrecoverable-sounding things, but this is way more work than it needs to be.

This PR makes it so that if acquiring the job fails in such a way that the error is unrecoverable, the agent will return a status code of 27 (chosen arbitrarily, because it's a nice number). This allows consumers of the agent to not try to pick up that job again.

Context

COMP-332 https://docs.google.com/document/d/1qjIXw2gm88iiDggQAKVMQx0qwKntPjtlA4pW84om3xw/edit Slack convo w/ Namespace

Testing

  • [x] Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
  • [x] Code is formatted (with go fmt ./...)

moskyb avatar May 02 '24 06:05 moskyb