gatk-sv icon indicating copy to clipboard operation
gatk-sv copied to clipboard

Various docker build inconsistencies

Open MattWellie opened this issue 1 year ago • 3 comments

Feature request

Slightly relevant to #513

Module(s) or script(s) involved

scripts/docker/build_docker.py inputs/values/dockers.json

Description

Apologies if this is a somewhat messy issue, I'll aim to revisit and polish.

We are currently trying to adopt GATK-SV into our own workflows, and to absorb all cost for this we want to build local versions of each relevant Docker image from scratch.

I've noticed a few idiosyncrasies in the build_docker.py script (some of which may be caused by my incorrect parsing of the process. If there is relevant documentation to mitigate these points please could you direct me to it!):

  1. It is not clear from the dockers.json file which images are built by this repo, and which images we would need to copy in from a public host
  2. the names of the docker files do not correspond to the image keys in the dockers.json dictionary. This makes it challenging to generate a usable dockers.json without bootstrapping from your existing files:
  • the dependencies matrix sets up the prerequisite builds for each image, as well as assigning each a key (e.g. samtools-cloud)
  • When writing the newly created images into the dockers.json file, the key to use is determined by the method get_target_from_image. This method determines each image's key based on the current docker image path, rather than the key in the dependencies matrix, or the named directory where each Dockerfile is found, e.g.
# dockers.json
{
    "samtools_cloud_docker": "artifactory/samtools-cloud:tag"
}

# dependency matrix entry
dependencies = {
        "samtools-cloud": ImageDependencies(
            git_dependencies="dockerfiles/samtools-cloud/*",
            docker_dependencies={
                "sv-base-mini": "MINIBASE_IMAGE",
                "samtools-cloud-virtual-env": "VIRTUAL_ENV_IMAGE"}
        )
}

# key definition
get_target_from_image("artifactory/samtools-cloud:tag") == 'samtools_cloud_docker'
  • dockerfiles/sv-pipeline/Dockerfile encodes a single image, but it is referred to in both the WDL files and dockers.json by all these various names.
"sv_pipeline_base_docker": "us.gcr.io/broad-dsde-methods/gatk-sv/sv-pipeline:2023-03-07-v0.26.10-beta-0c876f63", "sv_pipeline_docker": "us.gcr.io/broad-dsde-methods/gatk-sv/sv-pipeline:2023-03-07-v0.26.10-beta-0c876f63",
"sv_pipeline_hail_docker": "us.gcr.io/broad-dsde-methods/gatk-sv/sv-pipeline:2023-03-07-v0.26.10-beta-0c876f63",
"sv_pipeline_updates_docker": "us.gcr.io/broad-dsde-methods/gatk-sv/sv-pipeline:2023-03-07-v0.26.10-beta-0c876f63",
"sv_pipeline_qc_docker": "us.gcr.io/broad-dsde-methods/gatk-sv/sv-pipeline:2023-03-07-v0.26.10-beta-0c876f63",
"sv_pipeline_rdtest_docker": "us.gcr.io/broad-dsde-methods/gatk-sv/sv-pipeline:2023-03-07-v0.26.10-beta-0c876f63"

My issue summarised is:

If I use an empty dockers.json and set the --targets argument to all when building images:

  • Many images are not built at all (e.g. wham, gatk) as they are not defined in this repo. I haven't found documentation that explains in advance which images are built here and which are exceptions.
  • The dockers.json lacks many images by key; the duplication of a single image across multiple keys requires bootstrapping from the dockers.json file pointing to your own images, e.g. as above sv-pipeline is populated as a single key, but WDL files expect it referenced by all 6 various names.
  • The images which are populated in dockers.json are not accepted by the WDL files contained here, e.g. samtools-cloud is built, but samtools_cloud_docker is required (note both _docker suffix and potholing instead of hyphenation)

Proposal?

  • Enforce consistency between the image key in the dependency matrix, the name of the created docker image, and the label in the dockers.json file
  • Remove all many-to-one labels in the dockers.json, and ensure that throughout the WDL files each image is only referenced by a single key
  • Separate out external and internal docker images based on which are created by this repository and which are not. The image building process should populate one file fully, and that should be joined at runtime with all additional required images.

MattWellie avatar Mar 27 '23 06:03 MattWellie