CARE
CARE copied to clipboard
CARE: Context-Aware Read Error correction for Illumina reads
CARE 2.1: Context-Aware Read Error correction for Illumina reads.
Prerequisites
- C++17
- OpenMP
- Zlib
- GNU Make
- Python 3.8 (optional. See section 'Forests for CARE 2.0')
Additional prerequisites for GPU version
- CUDA Toolkit 11 or newer
- A CUDA-capable graphics card with Pascal architecture (e.g. Nvidia GTX 1080) or newer.
Download
Clone the repository and initialize submodules: git clone --recurse-submodules https://github.com/fkallen/CARE.git
Build
The build process assumes that the required compilers are available in your PATH.
Make
Run make to generate the executables.
CPU version: This produces an executable file care-cpu in the top-level directory of CARE
make / make cpu
GPU version: This produces an executable file care-gpu in the top-level directory of CARE
make gpu
Optionally, after executables have been built they can be copied to an installation directory via make install
.
This will copy available executables to the directory PREFIX/bin. The default value for PREFIX is /usr/local
.
A custom prefix can be set as follows:
make install PREFIX=/my/custom/prefix
Run
The simplest command which only includes mandatory options is
./care-cpu -i reads.fastq -d outputdir -o correctedreads.fastq -c 30 --pairmode PE
This command will attempt to correct the reads from file reads.fastq, assuming a read coverage of 30. The parameter --pairmode PE
is used to execute the paired-end correction path.
The outputfile named correctedreads.fastq will be placed in the directory outputdir. The available program parameters are listed below.
Input files must be in fasta or fastq format, and may be gzip'ed. Specifying both fasta files and fastq files together is not allowed.
If the input files are unpaired, the setting --pairmode SE
must be used, which selects the single-end correction path.
If the input files are paired instead, either --pairmode SE
or --pairmode PE
may be used. Paired-end information is only used when --pairmode PE
is set.
Output files will be uncompressed by default. The order of reads will be preserved. Read headers and quality scores (if fastq) remain unchanged. Compressed output can be enabled with --gzoutput
. (This can be really slow).
A more advanced usage could look like the following command. It enables progress counter -p
and uses quality scores -q
which are stored in a lossy compressed 2-bit format --qualityScoreBits 2
. The program should use 16 threads -t 16
with a memory limit of 22 gigabyte -m 22G
. Sequences which contain other letters than A,C,G,T, e.g. N, will be skipped --excludeAmbiguous
. -k
and -h
specify the parameters of the hashing, namely the k-mer size and the number of hash tables. With --candidateCorrection
, additional sequence corrections may be computed per read which are then used to either accept or reject the primary correction. This can improve correction quality (reduces FP, but also TP) at the expense of greater memory usage to store the additional corrections.
./care-cpu -i reads.fastq -d . -o correctedreads.fastq -c 30 --pairmode PE -p -q --qualityScoreBits 2 --excludeAmbiguous -m 22G -t 16 -k 20 -h 32 --candidateCorrection
The equivalent execution of the GPU version using two GPUs would be:
./care-gpu -i reads.fastq -d . -o correctedreads.fastq -c 30 --pairmode PE -p -q --qualityScoreBits 2 --excludeAmbiguous -m 22G -t 16 -k 20 -h 32 --candidateCorrection -g 0,1
Note the additional mandatory parameter -g
which accepts a comma-separated list of integers to indicate which GPUs can be used. The integers must be between 0 and N-1, where N is the number of available GPUs in the system.
Specifying input files
Single-end library
For a single-end library consisting of one or more files, repeat argument -i
for each file
Paired-end library
A paired-end library must be either a single file in interleaved format, or two files in split format.
Interleaved
Two consecutive reads form a read pair. Use -i reads_interleaved
.
Split
Read number N in file 1 and read number N in file 2 form a read pair. Use -i reads_1 -i reads_2
.
Available program parameters
General
Please execute ./care-cpu --help
or ./care-gpu --help
to print a list of available parameters. Both versions share a common subset of parameters.
The following list is a selection of usefull options.
-h, --hashmaps arg The requested number of hash maps. Must be
greater than 0. The actual number of used hash
maps may be lower to respect the set memory
limit. Default: 48
-t, --threads arg Maximum number of thread to use. Default: 1
-q, --useQualityScores If set, quality scores (if any) are
considered during read correction. Default: false
--candidateCorrection If set, candidate reads will be corrected,too. Default: false
-p, --showProgress If set, progress bar is shown during correction
-m, --memTotal arg Total memory limit in bytes. Can use suffix
K,M,G , e.g. 20G means 20 gigabyte. This option
is not a hard limit. Default: All free
memory.
If an option allows multiple values to be specified, the option can be repeated with different values. As an alternative, multiple values can be separated by comma (,). Both ways can be used simultaneously. For example, to specify three single-end input files the following options are equivalent:
-i file1,file2,file3
-i file1 -i file2 -i file3
-i file1 -i file2,file3
Confidence of corrections
When the option --correctionQualityLabels
is specified, CARE will append additional information to the read headers in the output file.
This information indicates the confidence of correction for each read.
care:q=1 --- Low confidence
care:q=2 --- Medium confidence
care:q=3 --- High confidence
care:q=0 --- All other reads
Note: A value of q > 0 does not imply that the read has been altered.
Forests for CARE 2.0
CARE 2.0 introduces a Random-Forest-based error correction mode. To use this mode, trained random forest classifiers need to be supplied using the program parameters
--correctionType 1 --ml-forestfile anchorforest.rf
and --correctionTypeCands 1 --ml-cands-forestfile candsforest.rf
for anchor correction and candidate correction, respectively.
The same forest files can be used for both the CPU version and the GPU version.
A small collection of pre-trained forests is available here
For more information about training the random forests please see the descriptions in ml/readme.md
Algorithm
This work is presented in the following papers.
Felix Kallenborn, Andreas Hildebrandt, Bertil Schmidt, CARE: Context-Aware Sequencing Read Error Correction, Bioinformatics, , btaa738, https://doi.org/10.1093/bioinformatics/btaa738
Kallenborn, F., Cascitti, J. & Schmidt, B. CARE 2.0: reducing false-positive sequencing error corrections using machine learning. BMC Bioinformatics 23, 227 (2022). https://doi.org/10.1186/s12859-022-04754-3