prime-simplereport icon indicating copy to clipboard operation
prime-simplereport copied to clipboard

Test that the prod deploy alert will alert if the backend and frontend can't talk to each other

Open fzhao99 opened this issue 1 year ago • 4 comments

In the recent postmortem, it came up that we weren't sure whether the recently added health check probe would "catch itself" even though it should. This ticket captures the work to prove to folks that it would.

See #6960 and this PR for more context

Acceptance criteria

  • A video / series of screenshots that show that if the backend is down / Okta is down / the Feature Flag db call throws, the frontend status page at https://www.simplereport.gov/app/health/deploy-smoke-test returns false.
    • We have verified that the script that spins up a Selenium viewer / sends out a Slack alert works when the script fails, but worth proving that that link in the chain works as well.
  • either in this ticket or create a follow up ticket for this: figure out how to page on-call person if this alert fires (first we want to test that it's working and hopefully improve noisiness)

To do the above, you'll probably need to

  1. Edit / duplicate a version of the prod deploy smoke test action to watch for a lower env
  2. Break the backend for that lower so that the frontend displays failure
  3. Screenshot / record that process to verify the alert triggers as expected

We also want to slightly tweak the alert due to it being slightly noisy: it seems to fail once every two dozen runs or so. On failure, might be worth taking a screenshot / logging out extra info for further investigation

fzhao99 avatar Jan 12 '24 21:01 fzhao99

Got a false positive run here: https://github.com/CDCgov/prime-simplereport/actions/runs/7588051271/job/20669739705#step:14:26

fzhao99 avatar Jan 19 '24 19:01 fzhao99

Backend/Okta/Feature flags.

It might be great to get a json blob on that page with values that tell us which of the services are up/down

alismx avatar Feb 27 '24 20:02 alismx

Update alert to happen for a lower env so we can turn a backend off in the lowers and see this trigger.

DanielSass avatar Apr 23 '24 17:04 DanielSass

smoke test endpoint intentionally broken and deployed to dev5 for testing image

smoke test deploy dev workflow configured to run on this testing branch after a completed deploy dev run image

dev5.simplereport.gov/app/health/deploy-smoke-test shows a failure status Screenshot 2024-05-13 132539

Slack alert successfully triggered image

mpbrown avatar May 13 '24 17:05 mpbrown