pplacer
pplacer copied to clipboard
feature suggestion: weights for reference sequences
This is sort of a half-baked thought at this point, but it has occurred to me that much information is lost when selecting representative reference sequences to include in a reference package: consider the case when the observed biological diversity for a species consists of many identical or very closely related reference sequences, and a small number of more divergent sequences. It is likely that in this case we would select only one representative of the most prevalent variant to include in the reference package - and in this case pplacer has no way to know which of the reference sequences are more "authoritative" when performing classification. I wonder if there would be some way to represent the prevalence of individual reference sequences among all candidate reference sequences in the form of a weight, and whether the taxonomic assignment could be informed by these weights. Whether it would matter is of course another question... I could imagine that it might help mitigate classification artifacts caused by including "outlier" reference sequences in the reference package.