sc2rf
sc2rf copied to clipboard
Terminal Ns not recognized as missing
While investigating https://github.com/cov-lineages/pango-designation/issues/590, I noticed that samples with the BA.2 S2M deletion (29734:29759) were being incorrectly visualized as having reference bases in sc2rf:
Consensus View:
sc2rf View:
I think this could be for a couple of reasons:
-
When
--enable-deletions
is used, perhaps deletions should not be considered missing data?missings_matches = ["N"] if not args.enable_deletions: missings_matches.append("-")
-
I think there is missing logic when detecting a run of Ns, to catch if that runs proceeds to the end of the genome?
if s in missings_matches: # we've been tracking a run of N's, this base marks the end if start_n == -1: start_n = i # mark the start of possible run of N's elif start_n >= 0: missings.append((start_n, i-1)) # Python-style (closed, open) interval start_n = -1 # Missing logic to catch missing data at the end of the genome? if i == len(reference) and s in missings_matches: missings.append((start_n, i-1))
With these changes, the sc2rf output more closely matches the consensus sequence/my expectation:
I think this is a bug, but if it's the intended behaviour for deletions, please let me know. Thanks!
Sorry, the code for both of these is:
python3 sc2rf.py pango_designation_590.fasta --ansi --unique 1 --enable-deletions