trimReads
trimReads copied to clipboard
Utility programs to trim or sort Illumina reads with adapter sequences
Illumina reads adapter screening utilities
This contains several utility programs that removes the adapters and low quality bases from Illumina reads.
:Author: Haibao Tang (tanghaibao <http://github.com/tanghaibao>
)
:Contributor: Tristan Lefebure
:Email: [email protected]
:License: BSD <http://creativecommons.org/licenses/BSD/>
.. contents ::
Installation
The program depends on the excellent SeqAn library <http://www.seqan.de/>
_.
Please download the library, and place seqan/
in the same folder.
Please note that trimReads
is no longer compatible with seqan-1.4+. For
back-ward compatibility, a copy of older seqan is now included as seqan.tgz
.
To install, run::
$ tar zxvf seqan.tgz
$ make
trimReads
Functionality emulates cutadapt <http://code.google.com/p/cutadapt/>
.
The adapter sequences are identified through Waterman-Eggert
algorithm
implemented in SeqAn <http://www.seqan.de/>
. The quality trimming are a
simple algorithm that takes the quality values, deduct a user specified cutoff,
and then finds the max-sum segment <http://en.wikipedia.org/wiki/Maximum_subarray_problem>
_. This method
guarantees that the average base quality is higher than the user cutoff.
There are other options to cut adapters, including cutadapt <http://code.google.com/p/cutadapt/>
_ and FASTX_TOOLKIT <http://hannonlab.cshl.edu/fastx_toolkit/>
_. The main advantage of this program:
- Accepts an adapter FASTA file
- Fast, robust and flexibility
- Qual/adapter trimming in one step
- Can trim both 5
- and 3
- end
Just run::
trimReads
to see a list of program options::
Illumina reads trimming utility
Author: Haibao Tang <[email protected]>
Usage: trimReads [options] fastqfile
-h, --help displays this help message
-o, --outfile Output file name. (default replace suffix with .trimmed.fastq)
-f, --adapterfile FASTA formatted file containing the adapters for removal (default adapters.fasta)
-s, --score Minimum score to call adapter match. Default scoring scheme for +1 match, -3 for mismatch/gapOpen/gapExtension. (default 15)
-q, --quality-cutoff Trim low-quality regions below quality cutoff. The algorithm is similar to the one used by BWA by finding a max-sum segment within the quality string. Set it to 0 to skip quality trimming. (default 20)
-m, --minimum-length Discard trimmed reads that are shorter than LENGTH. (default 30)
-Q, --quality-encoding Read quality encoding for input file. 64 for Illumina, 33 for Sanger. (default 64)
-d, --discard-adapter-reads Discard reads with adapter sequences rather than trim (default 0)
Find a list of adapters to remove (more will slow down search), default is adapters.fasta
. When ready::
trimReads test.fastq
to get a trimmed file test.trimmed.fastq
. To turn off the quality trimming, just set -q
to 0
::
trimReads -q 0 test.fastq
The detected adapter stretch will have quality values of AAAAAAAAAAA...
.
This will help you verify that the sequence masked is indeed adapters. For
example::
@SNPSTER4:7:1:2:458#0/1 run=090205_SNPSTER4_0273_30GAUAAXX_PE
ATTGAAGTGTTTGGGGTTCAAACACCGACAGATCGGAAGAGCGGTTCAGCAGGAAAGCCGAGACACACATCGGTATCCGCTTTTTTTTTT
+
aba`aaa]a`aaaaaa]a_aa\aa`aa_^AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBB
sortPairedReads
This program sorts all read pairs into three sets:
- Adapter set: the pairs with either /1 or /2 match adapters (in most cases both will match). These are fragments up to 1X read length.
- Overlap set: the pairs with /1 and /2 having dovetail overlap. These are fragments up to 2X read length.
- Clean set: survived the above two searches.
The reason for this sorting is to get rid of the short fragments (set 1 and set 2) commonly in the Illumina PE library. Some libraries are worse than others. The goal is to input the mated library within nominal insert size ranges.
Just run::
sortPairedReads
to see a list of program options::
Sort pairs of Illumina reads
Author: Haibao Tang <[email protected]>
Usage: sortPairedReads [options] fastqfile1 fastqfile2
-h, --help displays this help message
-O, --nooverlap Turn off overlapping reads detection, and do not create .overlap.fastq file. (default 0)
-f, --adapterfile FASTA formatted file containing the adapters for removal (default adapters.fasta)
-s, --adapterMatchScore Minimum score to call adapter match. Default scoring scheme for +1 match, -3 for mismatch/gapOpen/gapExtension. (default 15)
-t, --endMatchScore Minimum score to call dovetail match. Default scoring scheme for +1 match, -3 for mismatch/gapOpen/gapExtension. (default 20)
-Q, --quality-encoding Read quality encoding for input file. 64 for Illumina, 33 for Sanger. (default 64)
-v, --verbose Print alignments for debugging (default 0)
For any given two fastq files, the output contains 4 files: fastqfile1.adapters.fastq
(set 1),
fastqfile1.overlap.fastq
(set 2), fastqfile1.clean.fastq
and
fastqfile2.clean.fastq
(set 3). For genome assembler inputs, I recommend
discard set 1, treat set 2 as unmated, and treat set 3 as mated.
For example::
$ sortPairedReads s1.fastq s2.fastq
[0] Illumina_PE-1 found 0 times
[1] Illumina_PE-2 found 0 times
[2] Illumina_PE-1rc found 54 times
[3] Illumina_PE-2rc found 83 times
Processed 2500 sequences took 3.33262 seconds.
$ ls *.*.fastq
s1.clean.fastq s2.clean.fastq s1.adapters.fastq s1.overlap.fastq
Turn -O
on if you don't like .overlap.fastq
::
$ sortPairedReads s1.fastq s2.fastq -O
...
$ ls *.*.fastq
s1.clean.fastq s2.clean.fastq s1.adapters.fastq