rare-disease-wf icon indicating copy to clipboard operation
rare-disease-wf copied to clipboard

(WIP) best-practices workflow for rare disease

For rare-disease, the best practices and expected number of candidate variants for each inheritance mode are known. The actual filtering is easily done with a tool like slivar. This is a necessary first step with the following limitations:

  1. it leaves an analyst or clinician with choices on how to prioritize the 10-15 candidates variants or ~100 for autosomal (non de novo) dominant.
    • This is quite a small number, but the prioritization after this is highly variable across tools and analysts.
  2. it is limited text/spreadsheet output
  3. it assumes a high-quality, jointly-called VCF is already available
  4. it leaves the analyst with the chore of getting IGV set up, and browsing each candidate for each family.

Quickstart

Note, it is early days for the project. It will produce high-quality SNP/indel candidates but you may need experience with nextflow to run it easily.

This project currently has workflow that can be run as:

# NOTE that you need to remove everything after \ on each line for the command to work
# the comments here are just for documentation purposes.
nextflow run -resume -profile slurm rare-disease.nf \
    -config nextflow.config \    # a starting config is included in this repo. adjust from there.
    --xams "/path/to/*/*.cram" \ # NOTE that this is a string glob
    --ped $pedigree_file \       # see: https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format
    --fasta $reference_fasta \
    --gff $gff \                   # e.g. from: ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens/
    --slivarzip gnomad.hg38.zip  \  # from: https://github.com/brentp/slivar#gnotation-files
    --cohort_name my_rare_disease

Output

See this wiki page for more information about how to use the output.

This does:

  1. Run DeepVariant and GLNexus (we have shown these tools to give higher quality results for trios) in an efficient nextflow workflow that can be easily run in the cloud or on a cluster.
  2. Decompose and normalize variants.
  3. Annotate with bcftools csq and snpEff
  4. Annotate with allele frequency and inheritance modes using slivar
  5. Annotate with gene-based annotations:
    • clinvar-gene-phenotype
    • loss-of-function intolerance
  6. Output high-quality calls from slivar for recessive, dominant, x-linked, compound-het and other inheritance modes.
  7. Generates and links pre-made, standalone igv.js/jigv outputs for each candidate.

And the key output will be in: results-rare-disease/${cohort_name}.slivar.candidates.tsv which is something one can easily view in excel or other spreadsheet software. In addition, it will create: results-rare-disease/${cohort_name}.jigv.html and results-rare-disease/jigv_plots/* which together provide an HTML table and interactive igv.js views of each variant and associated alignments that do not rely on the original alignment files.

In coming releases, this will:

  1. Output QC with somalier and other tools to be shown in multiQC
  2. Output high-quality SVs (using manta-> graphtyper)

Octopus

currently, octopus is included as a separate workflow. This octopus.nf pipeline will detect trios and families and run them together and then iteratively merge across families using the n+1 schema described in the octopus docs Finally, the workflow will do the forest filtering as recommended by the octopus documentation. We plan to integrate the octopus and deepvariant calls in the future.

Future Development

Development and research is underway so that it will:

  1. Add a high-quality set of SV/CNVs
  2. Add some prioritization of variants
    • For example, lower priority to variants filtered in gnomAD
  3. Integrate SV/CNV calls with the snp/indels to find, for example compound heterozygotes with a snp:SV pair.
  4. Evaluate use of octopus to find large indels (and/or SNPs and indels).
  5. Use GTex + phenotypes to further prioritize variants in a family and phenotype-specific way, such that, for example variants in genes that are not expressed in relevant tissues are down-weighted.
  6. Provide a graphical-user-interface so that sorting, filtering, note-taking, sharing is simplified

Software Used

  • DeepVariant Variant Calling with Deep Learning. https://doi.org/10.1038/nbt.4235
  • GLNexus Joint variant calling. http://dx.doi.org/10.1101/343970
  • octopus haplotype-based mutation caller. https://doi.org/10.1038/s41587-021-00861-3
  • bcftools BCF/VCF manipulation. https://doi.org/10.1093/gigascience/giab008
  • bcftools csq variant consequence annotation. https://doi.org/10.1093/bioinformatics/btx100
  • htslib C libary for genomics data. https://doi.org/10.1093/gigascience/giab007
  • slivar variant filtering and annotation. https://doi.org/10.1101/2020.08.13.249532
  • igv.js. javascript genomics viewer. https://doi.org/10.1101/2020.05.03.075499
  • nextflow scientific workflows. https://doi.org/10.1038/nbt.3820
  • manta structural variant caller. https://doi.org/10.1093/bioinformatics/btv710
  • dysgu structural variant caller. https://doi.org/10.1101/2021.05.28.446147
  • paragraph structural variant genotyper. https://doi.org/10.1186/s13059-019-1909-7
  • jasmine structural variant merging. https://doi.org/10.1101/2021.05.27.445886
  • duphold structural variant depth annotation. https://doi.org/10.1093/gigascience/giz040
  • snpEff variant consequence annotation. https://doi.org/10.4161/fly.19695
  • svpack structural variant annotation.