ol-infrastructure icon indicating copy to clipboard operation
ol-infrastructure copied to clipboard

Write a high level wrapper for TestInfra to provide higher level system integration checks for our applications

Open feoh opened this issue 1 year ago • 5 comments

Description/Context

Recently we had a production outage in MIT XPro where AMIs were built without their static assets. Our current health check was unable to detect this.

We should write a higher level wrapper for testinfra that will enable us to easily create and maintain higher level integration tests that answer questions like:

  • Are all the necessary assets appearing when I hit a page?
  • Can the user log in properly?
  • @gumaerc suggested high level integration tests to validate Concourse health. Is the pipeline processing steps? (Other ideas/metrics for determining Concourse health should go here) etc.
  • Should run in a container. We can bind the docker socker to allow host level container visibility from inside another container.
  • Should expose a REST API so we can do health checks at runtime, not just container start
    • REST API should return a JSON blog detailing tests that passed or failed.
    • REST API should also return the results of the OpenEdX Health check endpoint inline

Plan/Design

Design/tasking TBD.

feoh avatar Jan 02 '24 18:01 feoh

Pipeline for new ol-infra-health-checks docker image.

feoh avatar Jan 19 '24 22:01 feoh

A few things:

  • I was incorrectly installing docker with an ancient version
  • I was missing the docker compose plugin Both issues are fixed.

Also, I needed a volume to expose the container to the external host's /etc/docker/compose/docker-compose.yaml so our health check container could introspect and connect to the containers under test (e.g. lms, cms)

But, I have all the infra in place, docker and docker compose properly installed, and my tests pass for reals this time! :)

feoh avatar Jan 23 '24 22:01 feoh

Also - Exposing /etc/docker/compose to the inside of our containers could be a security risk. I need to understand what's at stake here and how to mitigate if necessary.

feoh avatar Jan 23 '24 22:01 feoh

FastAPI layer written and tested locally, but can't get very far. We need everything properly situated in its final setting for any of it to work.

Spoke with Tobias and, quite reasonably, this healthcheck needs to live behind a Caddy proxy like everything else.

Need to finish Caddy configuration and docker-compose networking as well as integration testing.

feoh avatar Jan 25 '24 22:01 feoh

I have the following in place and tested as working:

  • Docker volumes for healthcheck container
  • Caddy configuration
  • FastAPI wrapper

Something's still not right. I can docker compose exec inside a container and hit the healthcheck container's http port and get correct results, but when I try hitting the Caddy https port, I get ... Nothing.

Time to dig deeper and maybe get some help tomorrow.

feoh avatar Jan 31 '24 22:01 feoh

Closing as complete for the initial foray. We can open more detailed issues for any follow-on work.

blarghmatey avatar Aug 08 '24 13:08 blarghmatey