outbreak.info icon indicating copy to clipboard operation
outbreak.info copied to clipboard

Cumulative prevalence calculations are very brittle to sequences with incorrect year

Open sacundim opened this issue 3 years ago • 1 comments

Outbreak.info uses in many places a cumulative prevalence metric that's defined like this (text from a footnote in the Location Tracker):

Apparent cumulative prevalence is the ratio of the sequences containing the lineage or mutation(s) to all sequences collected since the identification of lineage or mutation(s) in that location.

I was puzzled for a bit when I saw that the cumulative prevalence reported for Omicron in Puerto Rico was only 20%, since nearly every one out of hundreds of sequences since it was first detected there in late November has been Omicron. But then I spotted the cause: misdated sequences that list the year as "2021" but are really from January 2022.

Screen Shot 2022-02-03 at 12 14 20 PM

I am able to see those sequences in GenBank as well:

  • https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=Severe%20acute%20respiratory%20syndrome%20coronavirus%202%20(SARS-CoV-2),%20taxid:2697049&USAState_s=PR&Lineage_s=BA.1.1&Lineage_s=BA.1&CollectionDate_dr=2021-01-01T00:00:00.00Z%20TO%202021-01-31T23:59:59.00Z
Screen Shot 2022-02-03 at 12 18 28 PM

Now, obviously, Outbreak.info cannot submit the correct dates on sequence submissions to the databases, but my concern here is what I mention in the title: that the cumulative prevalence calculation is extremely brittle to this sort of misdating, which is a very common problem. My GenBank search identifies 19 sequences in this case, but consider that it likely only takes one single such misdated sequence to throw off the prevalence calculation, possibly by an order of magnitude, and now I find I can't trust for example these tables in the Mutation Tracker:

Screen Shot 2022-02-03 at 12 25 57 PM

There's got to be some sort of filter that can be applied in these calculations to produce more sensible results. One idea for example might be to curate a list of earliest "sensible" dates for each of the major variants, and to ignore from the calculation any sequences with sample dates earlier than that.

sacundim avatar Feb 03 '22 20:02 sacundim