fastp
fastp copied to clipboard
Request: multiple file in and single file out
there may be multiple lane data, I tried like:
fastp -i a.L1.R1.fq.gz a.L2.R1.fq.gz -I a.L1.R2.fq.gz a.L2.R2.fq.gz -o a.out.R1.fq.gz a.out.R2.fq.gz
but it seems only L1 data is processed.
You can run fastp multiple times
@lozybean I would suggest use process substitution of bash to achieve this. You can still use one linear command line. for example fastp --in1 <(zcat sample*_R1_001.fastq.gz) --in2 <(zcat sample*_R2_001.fastq.gz)
.
@Jiawei-Navican Process substitution does not work with fastp. If you check the output FASTQ files you will see that the read names for R1 and R2 do not match.
$ fastp -i R1.fq -I R2.fq -o R1.out.fq -O R2.out.fq
Read1 before filtering:
total reads: 100000
total bases: 4854072
Q20 bases: 4788994(98.6593%)
Q30 bases: 4661071(96.0239%)
Read2 before filtering:
total reads: 100000
total bases: 4761161
Q20 bases: 4693657(98.5822%)
Q30 bases: 4564765(95.875%)
Read1 after filtering:
total reads: 100000
total bases: 4854019
Q20 bases: 4788947(98.6594%)
Q30 bases: 4661025(96.024%)
Read2 aftering filtering:
total reads: 100000
total bases: 4761088
Q20 bases: 4693591(98.5823%)
Q30 bases: 4564703(95.8752%)
Filtering result:
reads passed filter: 200000
reads failed due to low quality: 0
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 0
bases trimmed due to adapters: 0
Duplication rate: 5.24207%
Insert size peak (evaluated by paired-end reads): 54
JSON report: fastp.json
HTML report: fastp.html
fastq -i R1.fq -I R2.fq -o R1.out.fq -O R2.out.fq
fastp v0.20.0, time used: 1 seconds
$ fastp -i <(cat R1.fq) -I <(cat R2.fq) -o R1.process_substitution.out.fq -O R2.process_substitution.out.fq
Read1 before filtering:
total reads: 87915
total bases: 4267649
Q20 bases: 4211066(98.6741%)
Q30 bases: 4099762(96.0661%)
Read2 before filtering:
total reads: 87915
total bases: 4185671
Q20 bases: 4126665(98.5903%)
Q30 bases: 4013551(95.8879%)
Read1 after filtering:
total reads: 87915
total bases: 4267551
Q20 bases: 4210974(98.6743%)
Q30 bases: 4099674(96.0662%)
Read2 aftering filtering:
total reads: 87915
total bases: 4185566
Q20 bases: 4126569(98.5905%)
Q30 bases: 4013460(95.8881%)
Filtering result:
reads passed filter: 175830
reads failed due to low quality: 0
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 6
bases trimmed due to adapters: 87
Duplication rate: 0%
Insert size peak (evaluated by paired-end reads): 32
JSON report: fastp.json
HTML report: fastp.html
fastq -i /dev/fd/63 -I /dev/fd/62 -o R1.process_substitution.out.fq -O R2.process_substitution.out.fq
fastp v0.20.0, time used: 0 seconds
$ for R1 in R1*.out.fq ; do echo $R1; awk 'NR % 4 == 1 && NR < 40' ${R1}; echo; done
R1.out.fq
@A00305:250:HTY5KDRXX:1:2101:4652:1031 1:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:12789:1031 1:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:20907:1031 1:N:0:NGGAATAT+CGGTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:8910:1047 1:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:11496:1047 1:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:13286:1047 1:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:14570:1047 1:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:27642:1047 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:28944:1047 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:3604:1063 1:N:0:CGGAATAT+CGGTGCTGGTGTAGAT
R1.process_substitution.out.fq
@A00305:250:HTY5KDRXX:1:2101:7934:34334 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:10267:34334 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:12201:34334 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:26955:34334 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:16984:34350 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:18069:34350 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:32009:34350 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:9543:34366 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:10664:34366 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:14714:34366 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
$ for R2 in R2*.out.fq ; do echo $R2; awk 'NR % 4 == 1 && NR < 40' ${R2}; echo; done
R2.out.fq
@A00305:250:HTY5KDRXX:1:2101:4652:1031 2:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:12789:1031 2:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:20907:1031 2:N:0:NGGAATAT+CGGTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:8910:1047 2:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:11496:1047 2:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:13286:1047 2:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:14570:1047 2:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:27642:1047 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:28944:1047 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:3604:1063 2:N:0:CGGAATAT+CGGTGCTGGTGTAGAT
R2.process_substitution.out.fq
@A00305:250:HTY5KDRXX:1:2101:11885:18192 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:14633:18192 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:14886:18192 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:22064:18192 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:24596:18192 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:31250:18192 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:5999:18208 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:14859:18208 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:18204:18208 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
As you can see from the fastp
commands with the actual files and with <(cat $fastq)
, fastp claims there are less reads in the second version.
As you can see R1.out.fq
and R2.out.fq
contain the same read IDs.
R1.process_substitution.out.fq
and R1.out.fq
are different, but also R1.process_substitution.out.fq
and R2.process_substitution.out.fq
.
So not only is it eating reads but not even the same number in both R1 and R2 FASTQ file.
I think fastp is trying to seek in the input fastq files, and it can't seek back in the files as they are a pipe and correspondingly looses data.
@sfchen Can multiple FASTQ files as input for both -i
and -I
be added or at least process substitution be supported for the input files. It also looks like fastp
does not check if the read names for R1 and R2 are matching. The latter would at least have warned me that something weird was going on.
@KimBioInfoStudio and @y9c and @sfchen any update on this issue is appreciated.