ol-infrastructure
ol-infrastructure copied to clipboard
Write a high level wrapper for TestInfra to provide higher level system integration checks for our applications
Description/Context
Recently we had a production outage in MIT XPro where AMIs were built without their static assets. Our current health check was unable to detect this.
We should write a higher level wrapper for testinfra that will enable us to easily create and maintain higher level integration tests that answer questions like:
- Are all the necessary assets appearing when I hit a page?
- Can the user log in properly?
- @gumaerc suggested high level integration tests to validate Concourse health. Is the pipeline processing steps? (Other ideas/metrics for determining Concourse health should go here) etc.
- Should run in a container. We can bind the docker socker to allow host level container visibility from inside another container.
- Should expose a REST API so we can do health checks at runtime, not just container start
- REST API should return a JSON blog detailing tests that passed or failed.
- REST API should also return the results of the OpenEdX Health check endpoint inline
Plan/Design
Design/tasking TBD.
Pipeline for new ol-infra-health-checks docker image.
A few things:
- I was incorrectly installing docker with an ancient version
- I was missing the docker compose plugin Both issues are fixed.
Also, I needed a volume to expose the container to the external host's /etc/docker/compose/docker-compose.yaml so our health check container could introspect and connect to the containers under test (e.g. lms, cms)
But, I have all the infra in place, docker and docker compose properly installed, and my tests pass for reals this time! :)
Also - Exposing /etc/docker/compose to the inside of our containers could be a security risk. I need to understand what's at stake here and how to mitigate if necessary.
FastAPI layer written and tested locally, but can't get very far. We need everything properly situated in its final setting for any of it to work.
Spoke with Tobias and, quite reasonably, this healthcheck needs to live behind a Caddy proxy like everything else.
Need to finish Caddy configuration and docker-compose networking as well as integration testing.
I have the following in place and tested as working:
- Docker volumes for healthcheck container
- Caddy configuration
- FastAPI wrapper
Something's still not right. I can docker compose exec inside a container and hit the healthcheck container's http port and get correct results, but when I try hitting the Caddy https port, I get ... Nothing.
Time to dig deeper and maybe get some help tomorrow.
Closing as complete for the initial foray. We can open more detailed issues for any follow-on work.