sc2rf icon indicating copy to clipboard operation
sc2rf copied to clipboard

Terminal Ns not recognized as missing

Open ktmeaton opened this issue 2 years ago • 1 comments

While investigating https://github.com/cov-lineages/pango-designation/issues/590, I noticed that samples with the BA.2 S2M deletion (29734:29759) were being incorrectly visualized as having reference bases in sc2rf:

Consensus View: image

sc2rf View: image

I think this could be for a couple of reasons:

  1. When --enable-deletions is used, perhaps deletions should not be considered missing data?

    missings_matches = ["N"]
    if not args.enable_deletions:
        missings_matches.append("-")
    
  2. I think there is missing logic when detecting a run of Ns, to catch if that runs proceeds to the end of the genome?

    if s in missings_matches:
        # we've been tracking a run of N's, this base marks the end              
        if start_n == -1:
            start_n = i  # mark the start of possible run of N's
    elif start_n >= 0:
        missings.append((start_n, i-1))  # Python-style (closed, open) interval
        start_n = -1
    
    # Missing logic to catch missing data at the end of the genome?
    if i == len(reference) and s in missings_matches:
        missings.append((start_n, i-1))
    

With these changes, the sc2rf output more closely matches the consensus sequence/my expectation:

image

I think this is a bug, but if it's the intended behaviour for deletions, please let me know. Thanks!

ktmeaton avatar May 06 '22 18:05 ktmeaton

Sorry, the code for both of these is:

python3 sc2rf.py pango_designation_590.fasta --ansi --unique 1 --enable-deletions

ktmeaton avatar May 06 '22 18:05 ktmeaton