giraffe-sv-paper icon indicating copy to clipboard operation
giraffe-sv-paper copied to clipboard

Giraffe Mapper Evaluation and Application Scripts

This repository contains scripts used to reproduce our work with the new Giraffe short read mapper in vg.

Workflow Overview

The scripts expect to be run in roughly this order:

  • Giraffe mapping evaluation workflow
    • Preparation scripts to preprocess input files
    • Graph construction scripts, to make test graphs for Giraffe to map to
    • Indexing scripts to prepare the constructed graphs for mapping
    • Read simulation scripts to produce simulated reads with known graph positions, for evaluating mapping correctness
    • Mapping scripts, for assessing the speed and accuracy of Giraffe against competing mappers
    • Genotyping scripts, for assessing competing genotyping methods
    • Plotting scripts, for plotting the results of the mapping scripts
    • Dedicated allele balance plotting scripts, for producing plots of how length-changing variants affect read coverage at variable sites
  • Structural variant calling workflow
  • Code and data archiving

Finding Files Used

If you do not have access to UCSC's internal AWS systems, you will probably not be able to access many of the files the scripts use at their given paths. Public archived copies of the data should be available via UCSC and via Zenodo with preregistered DOI 10.5281/zenodo.4721495.

Replication Considerations

Note that the top level workflows are not automated. Within each section, you will have to manually prepare the environment for and run each script. Some scripts expect to run locally with vg or snakemake installed and sufficient memory and scratch space, some scripts expect to run with access to a Kubernetes cluster, and some scripts expect to be launched on a Toil-managed autoscaling Mesos or Kubernetes cluster. We provide hints as to how to set up such environments, but a full tutorial is not given here. Additionally, scripts that launch asynchronous Kubernetes jobs do not include code to wait for the jobs to complete; that monitoring must be provided by you.

We provide scripts as close to what we actually ran as possible; these scripts will not be fully portable to your environment without modification. If you do not have access to UCSC's AWS storage buckets (such as s3://vg-k8s or s3://vg-data), or if you would like to avoid overwriting the original analysis artifacts, some scripts will have to be adapted to point at where you intend to keep your artifacts for your repetition of the analysis. Additionally, scripts designed to kick off Kubernetes jobs may need to be adapted to reference your Kubernetes environment's AWS credential secrets or namespace names.

The scripts provided here access the Internet, and invoke other software that accesses the Internet, to download code and container images. While we include code and container image snapshots in our code and data archive, we have not done the required engineering work in our software stack to enable those snapshots to be used as an alternative to the Internet loactions where our scripts, and the software they invoke, expect to find things. Consequently, if, say, quay.io decides to stop hosting the container images we used for free forever, the scripts are likely to stop working as written. Additionally, while we provide snapshots of the containers and software we produced for this work, we have not provided snapshots of other containers that our workflows use (such as, for example, aslethalfang/tabix:1.7). If these containers cease to be retrievable (for example, if they become old and Docker Hub deletes them due to inactivity, or if new authentication requirements become applicable for accessing them), then these scripts will stop working as written.