nextclade
nextclade copied to clipboard
ENH: Surface total number of unsequenced nucleotides (including tails) in parens
I was investigating what had happened in the newly designated BA.2.3.3 which appeared to have reversions at N:203/204. It turned out this is an artefact because the 3' end containing these sequences is not sequences in most representatives of this "lineage".
Now, when one looks at the missing count displayed by Nextclade, it's not evident that the ends are unsequenced (around 1k nucs). It would be nice if we could surface trailing unsequenced regions.
I know the reason we're not including these at the moment is that we don't want to penalize partial sequences - if that is intended by the sequencer (e.g. Sanger sequenced stuff, or just the Spike for SC2).
However, it would be good if we could surface the information available in alignmentStart
end alignmentEnd
so that it's easier to spot these sort of missing tails in case this is not intended, as is usually the case in SC2 sequencing.
I suggest we add the total number of unsequenced or missing nucleotides in parentheses in the table - similar to how we surface total number of frameshifts (including known ones) in parens.
Also, it'd be nice if we showed alignmentStart
and alignmentEnd
in the tooltips:
I suggest we use two headings, one for internally missing, and one for missing tails, with similar design as for frameshifts here:
@corneliusroemer
alignmentStart
and alignmentEnd
is shown in the tooltip of the "Sequence name" column as "Alignment" range. Also they are shown as grey areas in nuc sequence views. If they are not, it's a bug.
Again (see #757, #738), there is no clarity as to whether unsequenced 3' and 5' regions are missing or gaps, so I am not convinced it should go to the "missing" necessarily.
Historically (simpler implementation) Nextclade treats as missing only what explicitly comes as nuc N
. The unsequenced 3' and 5' regions are not N
s.
I don't have a strong opinion on the matter, other than that bioinfomaticians should formalize their field and terminology a little, so that things are less wobbly.
Please discuss this with Richard and the community. I am happy to improve Nextclade if there's a consensus.
You're right, they are surfaced in the sequence name tooltip - that's better than nothing but not really that obvious and easy to see. One shouldn't need to look at a tooltip to quickly see that large amounts are missing from beginning or end.
I understand the differentiation between internal and external missing nucs. That's why I suggested to add external nucs in parentheses - and leave the main missing number unchanged.
Right now, it's just not so easy to see at a glance that sequences are missing 1000 nucs from the end. If one sees alignment end is 29500
one needs to do mental arithmetic and know the ref lenght of SC2 to realize it's 500 nucs missing from the end.
@corneliusroemer
One shouldn't need to look at a tooltip to quickly see that large amounts are missing from beginning or end.
Are they not displayed as gray areas on both ends? Or is it too small to notice?
You mean we need to add how many nucs there are in these regions. Now I understand. Should be easy to add in places where the ranges are already displayed (in seq name column and in sequence view marker tooltips).
Are they not displayed as gray areas on both ends? Or is it too small to notice? Yes they are, that's how I noticed in that case described above, but only after a while. One needs to change to nuc view and then be attentive.
The discoverability/visibility would be much better if it was displayed straight in the main table. Hence my parenthesis proposal.
Because these are missing in some sense - though sometimes on purpose - they seem appropriate in the missing
column. The seqname column tooltip contains all sort of miscellaneous information at the moment. I never hover over it in practice.