crackq icon indicating copy to clipboard operation
crackq copied to clipboard

Add a healthcheck to the dockercompose for crackq to detect when GPU devices have disappeared

Open hkelley opened this issue 6 months ago • 2 comments

In an attempt to address the NVIDIA GPU flukiness (the crackq container sometimes loses the devices - https://github.com/NVIDIA/nvidia-container-toolkit/issues/48), I'm experimenting with:

  1. Adding a healthcheck to the crackq service in docker-compose to detect when the GPUs go missing
    crackq:
        build:
            context: ./build
            dockerfile: Dockerfile
        image: "nvidia-ubuntu"
        ports:
            - "127.0.0.1:8080:8080"
        depends_on:
            - redis
        healthcheck:
          test: hashcat -I | grep 'Backend Device'
          interval: 5m
          retries: 1
          start_period: 60s
          timeout: 30s
        networks:
            - crackq_net
  1. Once I'm confident the healthcheck is reliable, adding a service for https://hub.docker.com/r/willfarrell/autoheal/ to the docker-compose. This should be able to restart the crackq container. https://stackoverflow.com/questions/47088261/restarting-an-unhealthy-docker-container-based-on-healthcheck

I will update this issue as I make progress.

hkelley avatar Aug 19 '24 12:08 hkelley