deep-review icon indicating copy to clipboard operation
deep-review copied to clipboard

Predicting the clinical impact of human mutation with deep neural networks

Open evancofer opened this issue 5 years ago • 1 comments

Millions of human genomes and exomes have been sequenced, but their clinical applications remain limited due to the difficulty of distinguishing disease-causing mutations from benign genetic variation. Here we demonstrate that common missense variants in other primate species are largely clinically benign in human, enabling pathogenic mutations to be systematically identified by the process of elimination. Using hundreds of thousands of common variants from population sequencing of six non-human primate species, we train a deep neural network that identifies pathogenic mutations in rare disease patients with 88% accuracy and enables the discovery of 14 new candidate genes in intellectual disability at genome-wide significance. Cataloging common variation from additional primate species would improve interpretation for millions of variants of uncertain significance, further advancing the clinical utility of human genome sequencing.

https://doi.org/10.1038/s41588-018-0167-z

evancofer avatar Jul 26 '18 15:07 evancofer

Goal

Predict if a missense variant in an amino acid sequence is benign or pathogenic and greatly augment the amount of available training data by incorporating benign variants from primates.

Computational aspects

  • The PrimateAI network is a deep residual CNN to predict pathogenic/benign status of non-synonmous mutations in human proteomic sequences.
    • Input is a sequence of 51 amino acids, with the following information included at each position:
      • reference human amino acid
      • mutant human amino acid
      • position weight matrix from 99 vertebrates (11 primates, 50 non-primate mammals, 38 other)
    • The PrimateAI network was first trained in a semi-supervised manner to classify between benign variants and a set of balanced, randomly generated unlabeled variants.
      • the labeled benign variants were common variants (>=0.1% allele frequency) from ExAC/gnomAD with mean coverage >=1 (a total of 83,546 human variants), and common variants (>0.1% allele frequency) from six species of primates (total of 301,690 primate variants).
      • the unlabeled variants were filtered to remove any variants that were found in ExAC/gnomAD, or at loci with mean coverage <1 in ExAC/gnomAD, or at positions that could not be aligned with primate genomes. They were then sampled to match the trinucleotide context of the labeled variants.
    • 20,000 (50% benign primate variants and 50% unlabeled variants) variants were used for validation, and 20,000 (50% benign primate variants and 50% unlabeled variants) variants were used for testing.
    • The position weight matrix is also fed into two additional pre-trained networks, whose outputs are fed into PrimateAI.
      • For each position in the sequence of 51 amino acids, these two networks:
        • predict protein secondary structure (helix, beta sheet, or coil)
        • predict three-state solvent accessibility (burried, intermediate, or exposed)
      • The two "auxiliary" networks are pre-trained on crystal structures from the Protein Databank with 6367 examples for training, 400 for validation, and 500 for testing.
  • Analyzed data from the Exome Aggregation Consortium (ExAC) data and Genome Aggregation Database (gnomAD) to make predictions for coding variants in 123136 humans.
  • The analysis of ClinVar variants was limited to only 1146 variants (177 benign and 969 pathogenic), but this was out of necessity and fairness (they only used relatively newer variants to avoid inflating the performance of algorithms that had considered newer versions of ClinVar before the cutoff date).
  • Identified 14 new candidate genes in intellectual disability by analyzing missense de novo variants in the Deciphering Developmental Disorders cohort.

Comments

  • This paper highlights how useful it can be to augment training data with data from interesting/novel sources.
  • Although it could have posed challenges with the position weight matrix required in the model inputs, it would have been nice if the authors could have also analyzed the effects of short indels.
  • It would have been nice for the authors to compare performance against MutPred2 (which was released in 2017) instead of MutPred (which was released in 2009).
  • An analysis of (dis-)agreement between classifiers would be nice. I imagine that most clinical analyses of variants in rare diseases are not artificially limited to using a single classification method, and it would be a strong selling point if PrimateAI is correctly classifying variants that no other classifier can.
  • Their performance on the ClinVar database is a weak point, and they seem to suggest that this is due to problems in ClinVar's manual annotation methods. I think filtering some of the ClinVar variants by review status could have provided a better set of "gold standard" variants to evaluate their model on.

I have a few other points regarding the ClinVar evaluation (I think their misclassifications on ClinVar could have had a more rigorous analysis), but they are finer points to be sure:

  • I think that there is a lot of potential to use their model to explore and probe existing annotations. For instance, the authors mention that human variant annotators typically under-use features such as protein structure, but they do not attempt to determine if this is what actually caused their poor performance on ClinVar. Do other models make mistakes on the same variants that they do?
  • Were the models that outperform them on ClinVar trained on data primarily derived from human curated databases (which they posit could boost a models performance on human curated databases by teaching it to recapitulate human biases)? This possibility is mentioned by the authors, but not confirmed.
  • Their two auxilliary networks are able to predict changes in structure and accessibility associated with a variants they believe to be incorrectly annotated in ClinVar; if they masked the outputs of these sub-networks to simulate a structure-ignorant model, are the resulting predictions more in line with the ClinVar annotations? Conversely, does this cause a smaller change in predicted pathogenicity for the variants in ClinVar that they correctly classify? I realize this may be a little ambitious given the length of the manuscript and supplement.

evancofer avatar Aug 04 '18 19:08 evancofer