newrelic-quickstarts
newrelic-quickstarts copied to clipboard
[Repository] Improve observability of workflows
Summary
The workflows that run on this repository have become much more important for the continued functionality of the quickstarts ecosystem. To support that we need to be able to track how often our workflow succeed, be notified when they fail, and have the information available to debug issues when they arise.
What we want to know
- When validation/submission workflows run
- Whether they succeed or fail
- In the case of failure
- Relevant information is reported to facilitate debugging
Ideas
- Error and info level logging
- Forward logs to NR1?
- Logging library vs custom implementation
- Reporting workflow failure when it fails before the validation/submission step
- Currently only reports failure if we get to that step and it fails
- Workflow runs as Transactions?
Information to capture
- Workflow and job ids
- Time/date
- The quickstart or install plan that was being validated/submitted
- The associated PR
- Number of quickstarts or install plans being modified.
- For failure:
- The contents of the graphql query
- Associated error messages and codes
Possible Solutions
Acceptance Criteria
- add extra logging to
stdout
and console log it. - remove the use of the APM agent for the repo, and use custom events & logs instead.
- add an always running step at end of workflow to send status back to new relic.
- validate this data is reporting into the NEW DevEn account
- update dashboards that are keying off the APM events. Quickstart repo workflows
@aswanson-nr 👋
we discussed this and decided to break it down into smaller steps, a first pass would be:
- add logging to
stdout
- add an always running step at end of workflow to send status back to new relic.
- validate this data is reporting into the NEW DevEn account
this is just a general thought after some light thinking. for extra output for the future that we might want to log:
anything that would let us write/improve tests. as an example, we have functions validating graphql responses, etc. if we log what those responses are, we can then use that data in unit tests to catch errors in the future.
example failure: https://github.com/newrelic/newrelic-quickstarts/runs/5186615982?check_suite_focus=true.
in that instance, the information we have in the output isn't helpful. it would be helpful to include the whole response body from graphql, as well as the whole request body so that we can easily reproduce the failure -- and maybe write some tests using that info.
another thought:
for this workflow failure -- https://github.com/newrelic/newrelic-quickstarts/runs/5779399220?check_suite_focus=true -- we fail on the install plan step and dont get to the quickstart step.
if possible, it would be useful to know what install plans and quickstarts failed to submit / are blocked by this issue until its fixed -- if this were to succeed, what quickstarts and install plans would be updated. this would help us prioritize fixing the failure more appropriately.
Something like this would help us triage and prioritize the failure:
[!] The following (1) install plans are impacted by this:
* foobar
[!] The following (3) quickstarts are impacted by this:
* aws/aws-ec2
* aws/aws-dynamo-db
* apache
We could determine the install plans and quickstarts impacted by this, but it would require a lot of manual work each time.
Old issues will be closed after 105 days of inactivity. This issue has been quiet for 90 days and is being marked as stale. Reply here to keep this issue open.
This issue is being closed due to inactivity. Is this a mistake? Please re-open this issue or create a new one.