antigen icon indicating copy to clipboard operation
antigen copied to clipboard

Output cases per day, deme, and variant

Open huddlej opened this issue 3 years ago • 2 comments

Description

Building on the work in issue #22, output the number of cases per day, deme, and variant to support models like @marlinfiggins's Rt frequency dynamics models.

Example output looks like:

date	location	variant	sequences
2021-01-02	Alabama	other	3
2021-01-03	Alabama	other	3
2021-01-04	Alabama	other	12
2021-01-05	Alabama	other	73
2021-01-06	Alabama	other	36

See recent variant counts for the USA, for a complete example.

Possible solution

For SARS-CoV-2, "variants" are already well defined as phylogenetic lineages of interest. The closest analog in antigen would be a specific phenotype or a cluster of phenotypes in antigenic space. In @trvrb's original paper, he clustered phenotypes in 2D space as shown below in the bottom right panel:

image

To support this output, we may need to implement similar clustering logic that will group phenotypes into consistent lineages through time. Alternately, we could output cases per specific phenotype (potentially generating hundreds of different "variants").

We might implement this output as part of the same "case counts" output mentioned in #22 or as a separate file. We might also consider whether we want to parameterize how these variants are sampled to recreate the sampling bias present in real data where not all cases can be sequenced.

huddlej avatar Jan 19 '22 23:01 huddlej

@huddlej were you thinking that the clustering would happen inside Java? If we were to do that we would probably want to bring in some library for that.

An alternative, as you say, would be to output all of the phenotypic information and then do clustering using scikit-learn in Python. At least perhaps that's the right first step?

matsen avatar Jan 20 '22 00:01 matsen

@matsen That's a good point to clarify! It looks like Trevor originally applied clustering in a Mathematica notebook, so I think a scikit-learn approach would be a perfect first start for this issue. The Mathematica notebook could provide some direction about which data frames from antigen Trevor used for that clustering analysis.

huddlej avatar Jan 20 '22 00:01 huddlej