ncov icon indicating copy to clipboard operation
ncov copied to clipboard

Compare results using masking of De Maio et al

Open jameshadfield opened this issue 4 years ago • 4 comments

De Maio et al provide a number of candidates for masking from phylogentic reconstructions, grouped into a number of categories. We should explore what effect this has on our results and conclusions. Removal of (informative) SNPs will result in larger polytomies which (a) will largely be resolved by the coalescent via TreeTime and (b) will increase the time needed to run augur refine, neither of which are desirable.

Roughly, the proposed masking sites are:

start & end 1–55 and 29804–29903

sites that appear to be highly homoplasic and have no phylogenetic signal and/or low prevalence 187, 1059, 2094, 3037, 3130, 6990, 8022, 10323, 10741, 11074, 13408, 14786, 19684, 20148, 21137, 24034, 24378, 25563, 26144, 26461, 26681, 28077, 28826, 28854, 29700

homoplasic positions that are exclusive to a single sequencing lab or geographic location 4050, 13402

positions that, despite having strong phylogenetic signal, are also strongly homoplasic 11083, 15324, 21575

minimum sequence length 29,400

They propose some further checks which will be harder to incorporate into our analysis. They also "use a custom script to remove all sequences that are at least three substitutions away from any other sequence" -- this would be interesting to flag up and see how often it happens.

jameshadfield avatar May 06 '20 04:05 jameshadfield

We now mask "mask_sites: 13402, 24389, 24390.

rneher avatar May 23 '20 16:05 rneher

Any plans to revisit masking? De Maio et al's collection of Problematic Sites has grown quite a bit since initial publication. It even includes some sites contributed by my colleagues @russcd, @yatisht, @lgozasht and Bryan Thornlow, from analyzing Nextstrain's trees and variants to identify mutations with high parsimony scores and incorporating metadata to find lab-associated variants. The takehome (https://www.biorxiv.org/content/10.1101/2020.06.08.141127v1, I'm a coauthor so shameless self-promotion alert I guess) is that while the problematic sites generally don't disrupt major branches of the tree, they can cause problems closer to the leaves, and this could interfere with contact tracing efforts.

(Virological update about addition of sites identified by UCSC and others: https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473/12)

AngieHinrichs avatar Aug 26 '20 23:08 AngieHinrichs

Thanks Angie! We should indeed revisit these mask sites. They are still currently just 13402, 24389, 24390: https://github.com/nextstrain/ncov/blob/master/defaults/parameters.yaml#L75. The De Maio collection is worth investigating. However, there is increasing evidence of true convergence due to mutational pressure as well as selective pressure.

For seasonal flu we had a strategy where particular sites are masked while tree building but they still are used in ancestral state reconstruction. Something similar may well be warranted here.

trvrb avatar Jan 21 '21 21:01 trvrb

Adding a quick clarification. The suggested masking vcf that we provide does include a few sites that are highly homoplastic, but probably real (e.g., 11083). However, the vast majority of positions in our masking recommendations are both highly homoplastic and associated with one or a small set of sequencing groups. In the vcf file, we include a code indicating the reasons why each site was masked. So, it might work to mask only sites with tags "single_src" or "narrow_src" to remove the likely artifactual variants. I am also happy to provide additional clarification if helpful.

On Thu, Jan 21, 2021 at 1:59 PM Trevor Bedford [email protected] wrote:

Thanks Angie! We should indeed revisit these mask sites. They are still currently just 13402, 24389, 24390: https://github.com/nextstrain/ncov/blob/master/defaults/parameters.yaml#L75. The De Maio collection is worth investigating. However, there is increasing evidence of true convergence due to mutational pressure as well as selective pressure.

For seasonal flu we had a strategy where particular sites are masked while tree building but they still are used in ancestral state reconstruction. Something similar may well be warranted here.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nextstrain/ncov/issues/392#issuecomment-764968360, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACMZAMJAO6VHEVKDTPPOFPDS3CPU5ANCNFSM4M2DUTDA .

-- Russ Corbett-Detig Assistant Professor Department of Biomolecular Engineering University of California, Santa Cruz

russcd avatar Jan 27 '21 23:01 russcd