nextclade feat: add coverage qc metric

feat: add coverage qc metric

Open ivan-aksamentov opened this issue 1 year ago • 3 comments

Resolves #932 Related https://github.com/cevo-public/cov-spectrum-website/issues/545

In this PR:

[x] calculate coverage QC metric as defined in the issues linked above
[x] add it to CSV and TSV output
[x] add it to web UI - a circle with letter "c" in the "QC" column

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
nextclade	✅ Ready (Inspect)	Visit Preview	Aug 30, 2022 at 1:07AM (UTC)

Aug 09 '22 03:08 vercel[bot]

We may or may not keep it is as a QC metric, depending on whether we want to signal low coverage as something bad or not.

If we don't want that, then we could just remove all the QC code here and move the coverage field into the main struct (which is reflected in JSON and NDJSON outputs) and rename the CSV column to simply coverage. In this case we need to decide where it should be displayed in the web UI.

Aug 09 '22 03:08 ivan-aksamentov

Fantastic Ivan, I will give this a deep check today. You're a 🌟

Aug 09 '22 08:08 corneliusroemer

On way to surface coverage would be in the missing column - and/or the tooltip. After all it's related to missing, it's missing inside + missing at terminals.

Aug 18 '22 12:08 corneliusroemer

@corneliusroemer

On way to surface coverage would be in the missing column - and/or the tooltip

Are you saying that it's better to NOT make it a new QC metric and treat it as a simple number, like many other numbers in other columns?

This is important question in my view. Do we want to signal low coverage as sequencing fault or not? For example Spike-only sequences will be in deep red QC territory.

Please ask Richard if unsure.

Aug 18 '22 12:08 ivan-aksamentov

I think it would be good to flag this as a moderate QC issue - but not cause a sequence to be red. That's why I suggested a max QC score above. But yeah, we should discuss with Richard. We can have both: treat it as a number in column and QC - the two are kind of separate (we treat missing both as QC issue and as a number to be shown).

Slight number formatting issue but this can be fixed later once we've decided how to treat coverage

Aug 18 '22 12:08 corneliusroemer

regarding the formula, I would simply report "what fraction of the reference do we have information for". And that would be

let total_covered_nucs = total_aligned_nucs - total_missing - total_non_acgtns;

assuming that all ambiguous nucleotides are in the aligned bits. That is almost always true, but not guaranteed.

Aug 18 '22 13:08 rneher

regarding how to surface: People have been confused that short sequences are labeled as shiny high quality ones. But it is not a quality issue per se. A little torn on this one. I'd rather not have it be a QC issue.

Aug 18 '22 13:08 rneher

regarding how to surface: People have been confused that short sequences are labeled as shiny high quality ones. But it is not a quality issue per se. A little torn on this one. I'd rather not have it be a QC issue.

A compromise could be to have it show up as an orange badge but not contribute to the overall score. Then it would be easy to spot incomplete sequences quickly without having to parse numbers.

But in any case, if it's a column of its own people should be able to do their own filtering.

Aug 18 '22 14:08 corneliusroemer

So if we don't put it as a QC metric, we'd have to change the code a bit, right? @ivan-aksamentov

Surface it as a column of its own, just as a float.

Aug 18 '22 16:08 corneliusroemer

@corneliusroemer @rneher

I removed the QC metric and added coverage as a column.

Please:

check that all the outputs make sense (JSON, NDJSON, CSV, TSV, Web).
write tootltip text for the column:

https://github.com/nextstrain/nextclade/blob/ff9767b42f31bab460e686f746e15c9679603b5c/packages_rs/nextclade-web/src/components/Results/HelpTips/HelpTipsColumnCoverage.mdx?plain=1#L1-L3

Aug 18 '22 23:08 ivan-aksamentov

As an optional change, I also added new fields to output files

refLength
totalCoveredNucs

because lack of bits for custom calculations was also mentioned in the original issue. They are byproducts of calculation of coverage, so why not use them. I think some users may enjoy having ref length without needing to have the entire ref sequence around or hardcoding the length. And totalCoveredNucs might be useful if only one gene is sequenced - then they can calculate their own coverage for that one gene, knowing its length.

On the negative side: it's not that we don't have enough columns already. The CSV/TSV files are already humongous. Ref length is the same for all sequences and just duplicated.

Let me know if you want to keep this.

Aug 19 '22 00:08 ivan-aksamentov

Thanks that's great. I wouldn't yet include ref length and covered nucs - these are pretty easily derived as is.

We can always add them later but removing columns is worse.

Aug 27 '22 10:08 corneliusroemer

@ivan-aksamentov I'd get rid of the covered-nucs and ref-length columns for now and reduce the number of significant figures to 5 for the float of coverage. Also if we could move the column on the right of Ns that'd be great.

Otherwise ready to merge! Should be useful.

@chaoran-chen this should do the job, right? You get a number between 0 and 1 that says what proportion of the sequence we have information on.

Aug 29 '22 18:08 corneliusroemer

nextclade nextclade copied to clipboard

feat: add coverage qc metric

nextclade
nextclade copied to clipboard