fastp Request: multiple file in and single file out

Request: multiple file in and single file out

Open lozybean opened this issue 5 years ago • 4 comments

there may be multiple lane data, I tried like:

fastp -i a.L1.R1.fq.gz a.L2.R1.fq.gz -I a.L1.R2.fq.gz a.L2.R2.fq.gz -o a.out.R1.fq.gz a.out.R2.fq.gz

but it seems only L1 data is processed.

Aug 12 '19 06:08 lozybean

You can run fastp multiple times

Aug 12 '19 06:08 sfchen

@lozybean I would suggest use process substitution of bash to achieve this. You can still use one linear command line. for example fastp --in1 <(zcat sample*_R1_001.fastq.gz) --in2 <(zcat sample*_R2_001.fastq.gz).

Jan 07 '20 17:01 Jiawei-Navican

@Jiawei-Navican Process substitution does not work with fastp. If you check the output FASTQ files you will see that the read names for R1 and R2 do not match.

$ fastp -i R1.fq -I R2.fq -o R1.out.fq -O R2.out.fq

Read1 before filtering:
total reads: 100000
total bases: 4854072
Q20 bases: 4788994(98.6593%)
Q30 bases: 4661071(96.0239%)

Read2 before filtering:
total reads: 100000
total bases: 4761161
Q20 bases: 4693657(98.5822%)
Q30 bases: 4564765(95.875%)

Read1 after filtering:
total reads: 100000
total bases: 4854019
Q20 bases: 4788947(98.6594%)
Q30 bases: 4661025(96.024%)

Read2 aftering filtering:
total reads: 100000
total bases: 4761088
Q20 bases: 4693591(98.5823%)
Q30 bases: 4564703(95.8752%)

Filtering result:
reads passed filter: 200000
reads failed due to low quality: 0
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 0
bases trimmed due to adapters: 0

Duplication rate: 5.24207%

Insert size peak (evaluated by paired-end reads): 54

JSON report: fastp.json
HTML report: fastp.html

fastq -i R1.fq -I R2.fq -o R1.out.fq -O R2.out.fq
fastp v0.20.0, time used: 1 seconds





$ fastp -i <(cat R1.fq) -I <(cat R2.fq) -o R1.process_substitution.out.fq -O R2.process_substitution.out.fq

Read1 before filtering:
total reads: 87915
total bases: 4267649
Q20 bases: 4211066(98.6741%)
Q30 bases: 4099762(96.0661%)

Read2 before filtering:
total reads: 87915
total bases: 4185671
Q20 bases: 4126665(98.5903%)
Q30 bases: 4013551(95.8879%)

Read1 after filtering:
total reads: 87915
total bases: 4267551
Q20 bases: 4210974(98.6743%)
Q30 bases: 4099674(96.0662%)

Read2 aftering filtering:
total reads: 87915
total bases: 4185566
Q20 bases: 4126569(98.5905%)
Q30 bases: 4013460(95.8881%)

Filtering result:
reads passed filter: 175830
reads failed due to low quality: 0
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 6
bases trimmed due to adapters: 87

Duplication rate: 0%

Insert size peak (evaluated by paired-end reads): 32

JSON report: fastp.json
HTML report: fastp.html

fastq -i /dev/fd/63 -I /dev/fd/62 -o R1.process_substitution.out.fq -O R2.process_substitution.out.fq
fastp v0.20.0, time used: 0 seconds



$ for R1 in R1*.out.fq ; do echo $R1; awk 'NR % 4 == 1 && NR < 40' ${R1}; echo; done

R1.out.fq
@A00305:250:HTY5KDRXX:1:2101:4652:1031 1:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:12789:1031 1:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:20907:1031 1:N:0:NGGAATAT+CGGTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:8910:1047 1:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:11496:1047 1:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:13286:1047 1:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:14570:1047 1:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:27642:1047 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:28944:1047 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:3604:1063 1:N:0:CGGAATAT+CGGTGCTGGTGTAGAT

R1.process_substitution.out.fq
@A00305:250:HTY5KDRXX:1:2101:7934:34334 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:10267:34334 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:12201:34334 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:26955:34334 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:16984:34350 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:18069:34350 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:32009:34350 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:9543:34366 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:10664:34366 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:14714:34366 1:N:0:CGGAATAT+CGTTGCTGGTGTAGAT



$ for R2 in R2*.out.fq ; do echo $R2; awk 'NR % 4 == 1 && NR < 40' ${R2}; echo; done

R2.out.fq
@A00305:250:HTY5KDRXX:1:2101:4652:1031 2:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:12789:1031 2:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:20907:1031 2:N:0:NGGAATAT+CGGTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:8910:1047 2:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:11496:1047 2:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:13286:1047 2:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:14570:1047 2:N:0:NGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:27642:1047 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:28944:1047 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:3604:1063 2:N:0:CGGAATAT+CGGTGCTGGTGTAGAT

R2.process_substitution.out.fq
@A00305:250:HTY5KDRXX:1:2101:11885:18192 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:14633:18192 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:14886:18192 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:22064:18192 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:24596:18192 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:31250:18192 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:5999:18208 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:14859:18208 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT
@A00305:250:HTY5KDRXX:1:2101:18204:18208 2:N:0:CGGAATAT+CGTTGCTGGTGTAGAT

As you can see from the fastp commands with the actual files and with <(cat $fastq), fastp claims there are less reads in the second version.

As you can see R1.out.fq and R2.out.fq contain the same read IDs. R1.process_substitution.out.fq and R1.out.fq are different, but also R1.process_substitution.out.fq and R2.process_substitution.out.fq. So not only is it eating reads but not even the same number in both R1 and R2 FASTQ file.

I think fastp is trying to seek in the input fastq files, and it can't seek back in the files as they are a pipe and correspondingly looses data.

@sfchen Can multiple FASTQ files as input for both -i and -I be added or at least process substitution be supported for the input files. It also looks like fastp does not check if the read names for R1 and R2 are matching. The latter would at least have warned me that something weird was going on.

Oct 28 '20 16:10 ghuls

@KimBioInfoStudio and @y9c and @sfchen any update on this issue is appreciated.

Apr 14 '23 17:04 farshadf

fastp fastp copied to clipboard

Request: multiple file in and single file out

fastp
fastp copied to clipboard