scythe icon indicating copy to clipboard operation
scythe copied to clipboard

Interleave support

Open vsbuffalo opened this issue 12 years ago • 0 comments

This is an issue to open up discussion on interleave support on Scythe. I have some time to work on my quality control pipeline components and I'd like to use this precious time wisely. I should mention the nice thing about interleaved files is that for the most part, they allow PE reads to be handled as SE, so nothing really needs to be added to Scythe, except for support for orphans (see below).

Here are my thoughts about quality control in general. Many of beliefs were formed by seeing some pathological corner cases while working at the Bioinformatics Core (since we saw lots of sequencing data, from different preps, etc). I believe all of these should be handled well to make this process friendlier for an end-user.

  1. There should be a fast, preferably streaming tool to gather quality statistics. I wrote qrqc with this intention, because I like having the power of R at my disposal, but I am currently unhappy with having to open R to deal with streaming data via qrqc's readSeqFile.
  2. The quality pipeline should have statistics gathered in between every step. Again, this stems from my paranoia, having seen pathological data plus tools that didn't handle such data well, and having both combine into a giant mess. The best way to avoid this is by not trusting your data and tools, which in practice, means running every program's output through streaming a QC gathering tool.
  3. Nothing should require writing to file. Disks are too slow, and our QC approaches should be modular and built around Unix pipes. Every tool should be able to be used alone, embracing the Unix philosophy.
  4. However, with (3), handling pairs is tricky. Standard error should not be abused to handle one of the paired files. So this leaves one option: interleaving pairs. However, interleaving pairs brings up another problem. If a quality control step removes an entire read, how do we maintain pairing information? There is no standard way. Scythe currently outputs one N to maintain pairing information. I strongly dislike re-pairing programs, as these waste CPU cycles to repair something that would be unnecessary with some sort of loose standard or best practice. One option is to remove both pairs, but I see this is a lot of unnecessary data loss. Programs like sickle take these orphaned/unpaired single reads and output them to a separate file. But again, we're writing to disk here which is wasteful. But it's not guaranteed that programs will handle empty FASTQ lines, or lines with a single N well. Thus far, I like the idea of single low-quality N entries, and explicitly handling these at the end of the pipeline with a program that takes interleaved files and splits them into paired files and an orphaned read files.

Keeping this in mind, I'd like to add interleave support to Scythe. How orphaned reads are handled is up to debate. So far, I envision a pipeline like:

interleave -1 reads-1.fastq -2 reads-2.fastq | \ # interleave paired-end files
  qualstat -o raw-qual.Rdata | \ # gather pre-quality step statistics
  scythe -p 0.3 -a adapters.fasta - | \ # trim adapters
  qualstat -o after-scythe-qual.Rdata | \ # gather post-adapter-trimming stats
  seqtk trimfq -q 0.05 - | \ # trim FASTQ poor quality; sickle se could be used here too
  qualstat -o after-trimfq-qual.Rdata | \ # gather post-quality-trimming stats
  uninterleave -1 reads-trimmed-1.fastq -2 reads-trimmed-2.fastq -s reads-trimmed-singles.fastq # optional uninterleaving

qualstat is a C program that generates an R object (optionally a simple JSON file) that can be analyzed later.

Feedback/ideas would be appreciated!

Edit: Another option is to trim, but always leave a minimum number of bases (I believe bwa does/did this at one point — I seem to remember it only trimming and leaving 35bp minimum). I toyed with this idea, but I do not like this approach because this will affect mapping rate statistics. These short poor quality sequences may not map, and could inflate non-mapping rates.

vsbuffalo avatar May 30 '13 22:05 vsbuffalo