poreCov icon indicating copy to clipboard operation
poreCov copied to clipboard

Frameshift correction

Open hoelzer opened this issue 2 years ago • 0 comments

It happens quite frequently that FSs are introduced in consensus sequences. In almost all cases these are errors.

Suggestion:

We could integrate a new tool proovframe to correct FS based on aligning reference protein sequences to the consensuses.

  • https://github.com/thackl/proovframe

I just tried this yet with a single example sequence so this would need more proper benchmarking:

Top: original sequence w/ FS from poreCov Middle: sequence after proovframe correction w/ all SC2 proteins as reference. However, this introduces another error in ORF1a likely due to the polyprotein structure of ORF1ab! Bottom: Thus, I removed the protein sequence of the polyprotein from the reference FASTA and this seems to work. Sequence fixed

image

Reference protein FASTA used w/o the ORF1ab polyprotein: GCF_009858895.2_ASM985889v3_protein_noORF1ab.faa.zip

Commands:

# map proteins to reads
proovframe/bin/proovframe map -a GCF_009858895.2_ASM985889v3_protein_noORF1ab.faa -o raw-seqs.tsv sample.consensus.fasta

# fix frameshifts in reads
proovframe/bin/proovframe fix -o corrected.fasta sample.consensus.fasta raw-seqs.tsv

However: I would suggest then providing these fs-corrected consensus sequences in addition to the default consensus sequences. It would need proper benchmarking to figure out if these corrections do not introduce any other potential errors for SARS-CoV-2 sequences.

hoelzer avatar Apr 06 '22 11:04 hoelzer