nextclade ENH: Surface total number of unsequenced nucleotides (including tails) in parens

ENH: Surface total number of unsequenced nucleotides (including tails) in parens

Open corneliusroemer opened this issue 2 years ago • 4 comments

I was investigating what had happened in the newly designated BA.2.3.3 which appeared to have reversions at N:203/204. It turned out this is an artefact because the 3' end containing these sequences is not sequences in most representatives of this "lineage".

Now, when one looks at the missing count displayed by Nextclade, it's not evident that the ends are unsequenced (around 1k nucs). It would be nice if we could surface trailing unsequenced regions.

I know the reason we're not including these at the moment is that we don't want to penalize partial sequences - if that is intended by the sequencer (e.g. Sanger sequenced stuff, or just the Spike for SC2).

However, it would be good if we could surface the information available in alignmentStart end alignmentEnd so that it's easier to spot these sort of missing tails in case this is not intended, as is usually the case in SC2 sequencing.

I suggest we add the total number of unsequenced or missing nucleotides in parentheses in the table - similar to how we surface total number of frameshifts (including known ones) in parens.

Also, it'd be nice if we showed alignmentStart and alignmentEnd in the tooltips:

I suggest we use two headings, one for internally missing, and one for missing tails, with similar design as for frameshifts here:

Apr 23 '22 13:04 corneliusroemer

@corneliusroemer

alignmentStart and alignmentEnd is shown in the tooltip of the "Sequence name" column as "Alignment" range. Also they are shown as grey areas in nuc sequence views. If they are not, it's a bug.

Again (see #757, #738), there is no clarity as to whether unsequenced 3' and 5' regions are missing or gaps, so I am not convinced it should go to the "missing" necessarily.

Historically (simpler implementation) Nextclade treats as missing only what explicitly comes as nuc N. The unsequenced 3' and 5' regions are not Ns.

I don't have a strong opinion on the matter, other than that bioinfomaticians should formalize their field and terminology a little, so that things are less wobbly.

Please discuss this with Richard and the community. I am happy to improve Nextclade if there's a consensus.

Apr 25 '22 10:04 ivan-aksamentov

You're right, they are surfaced in the sequence name tooltip - that's better than nothing but not really that obvious and easy to see. One shouldn't need to look at a tooltip to quickly see that large amounts are missing from beginning or end.

I understand the differentiation between internal and external missing nucs. That's why I suggested to add external nucs in parentheses - and leave the main missing number unchanged.

Right now, it's just not so easy to see at a glance that sequences are missing 1000 nucs from the end. If one sees alignment end is 29500 one needs to do mental arithmetic and know the ref lenght of SC2 to realize it's 500 nucs missing from the end.

Apr 25 '22 10:04 corneliusroemer

@corneliusroemer

One shouldn't need to look at a tooltip to quickly see that large amounts are missing from beginning or end.

Are they not displayed as gray areas on both ends? Or is it too small to notice?

You mean we need to add how many nucs there are in these regions. Now I understand. Should be easy to add in places where the ranges are already displayed (in seq name column and in sequence view marker tooltips).

Apr 25 '22 11:04 ivan-aksamentov

Are they not displayed as gray areas on both ends? Or is it too small to notice? Yes they are, that's how I noticed in that case described above, but only after a while. One needs to change to nuc view and then be attentive.

The discoverability/visibility would be much better if it was displayed straight in the main table. Hence my parenthesis proposal.

Because these are missing in some sense - though sometimes on purpose - they seem appropriate in the missing column. The seqname column tooltip contains all sort of miscellaneous information at the moment. I never hover over it in practice.

Apr 25 '22 11:04 corneliusroemer

nextclade nextclade copied to clipboard

ENH: Surface total number of unsequenced nucleotides (including tails) in parens

nextclade
nextclade copied to clipboard