poreCov
poreCov copied to clipboard
Frameshift correction
It happens quite frequently that FSs are introduced in consensus sequences. In almost all cases these are errors.
Suggestion:
We could integrate a new tool proovframe
to correct FS based on aligning reference protein sequences to the consensuses.
- https://github.com/thackl/proovframe
I just tried this yet with a single example sequence so this would need more proper benchmarking:
Top: original sequence w/ FS from poreCov
Middle: sequence after proovframe
correction w/ all SC2 proteins as reference. However, this introduces another error in ORF1a likely due to the polyprotein structure of ORF1ab!
Bottom: Thus, I removed the protein sequence of the polyprotein from the reference FASTA and this seems to work. Sequence fixed
Reference protein FASTA used w/o the ORF1ab polyprotein: GCF_009858895.2_ASM985889v3_protein_noORF1ab.faa.zip
Commands:
# map proteins to reads
proovframe/bin/proovframe map -a GCF_009858895.2_ASM985889v3_protein_noORF1ab.faa -o raw-seqs.tsv sample.consensus.fasta
# fix frameshifts in reads
proovframe/bin/proovframe fix -o corrected.fasta sample.consensus.fasta raw-seqs.tsv
However: I would suggest then providing these fs-corrected consensus sequences in addition to the default consensus sequences. It would need proper benchmarking to figure out if these corrections do not introduce any other potential errors for SARS-CoV-2 sequences.