cDNA_Cupcake icon indicating copy to clipboard operation
cDNA_Cupcake copied to clipboard

feature request: integrating short read evidence with Iso-Seq for isoform/fusion detection

Open Magdoll opened this issue 7 years ago • 0 comments

artifacts are present in Iso-Seq data often due to library preparation (template switching, PCR chimeras) that are difficult to remove because they come from the cDNA level (and not at the sequencing or bioinformatics level) and introduce false positives for both discovery of novel isoforms and fusion genes.

short reads and other seq platforms can serve as an independent evidence to reduce these artifacts. a simple check is to align short read data (using --all options to allow multi-mapping) to collapsed Iso-Seq data and have a script that removes / filters all Iso-Seq data that is not supported continuously at each base with a minimum T coverage by short reads (note: short reads often have issues with covering the ends, so the focus here is in the middle section).

this process should consist of several steps:

  • [ ] align using fast short read aligners (bowtie2, bwa) of short reads to collapsed, high-quality, Iso-Seq data. care needs to be taken to allow for more errors and multi-mapping since Iso-Seq data will contain many isoforms that may share the same exons and we want them to all map.

  • [ ] create a coverage map per Iso-Seq transcript (best format to show this is?)

  • [ ] user-defined filtering criteria (how much to ignore at the ends, minimum base coverage at what interval, etc)

  • [ ] output filtered/reduced set of Iso-Seq data and a summary report

Magdoll avatar Mar 16 '17 16:03 Magdoll