proovframe icon indicating copy to clipboard operation
proovframe copied to clipboard

frame-shift correction for long-read (meta)genomics

  • proovframe: frame-shift correction for long read (meta)genomics

Gene prediction on long reads, aka PacBio and Nanopore, is often impaired by indels causing frameshift. Proovframe detects and corrects frameshifts in coding sequences from raw long reads or long-read derived assemblies.

#+ATTR_HTML: :width 600px [[file:implementation.png]]

Proovframe uses frameshift-aware alignments to reference proteins as guides, and conservatively restores frame-fidelity by 1/2-base deletions or insertions of "N/NN"s, and masking of premature stops ("NNN").

Good results can already be obtained with distantly related guide proteins- successfully tested with sets with <60% amino acid identity.

It can be used as an additional polishing step on top of classic consensus-polishing approaches for assemblies.

It can be used on raw reads directly, which means it can be used on data lacking sequencing depth for consensus polishing - a common problem for a lot of rare things from environmental metagenomic samples, for example.

** Usage

Requires [[https://github.com/bbuchfink/diamond][DIAMOND v2.0.3]] or newer for mapping.

#+begin_src sh

install

git clone https://github.com/thackl/proovframe

map proteins to reads

proovframe/bin/proovframe map -a proteins.faa -o raw-seqs.tsv raw-seqs.fa

fix frameshifts in reads

proovframe/bin/proovframe fix -o corrected-seqs.fa raw-seqs.fa raw-seqs.tsv #+end_src

** Citing

If you use proovframe and DIAMOND please cite:

  • Hackl T, Trigodet F, Murat Eren A, Biller SJ, Eppley JM, Luo E, et al. "proovframe: frameshift-correction for long-read (meta)genomics", bioRxiv. 2021. p. 2021.08.23.457338. doi:10.1101/2021.08.23.457338
  • Buchfink B, Reuter K, Drost HG, "Sensitive protein alignments at tree-of-life scale using DIAMOND", Nature Methods 18, 366–368 (2021). doi:10.1038/s41592-021-01101-x