fastp
fastp copied to clipboard
Filter only one read of paired-end experiment but remove corresponding reads in other file [feature request]
This tool is really fantastic, I would love to integrate it into my workflows. However, there is one thing that would be very useful. In our single-cell experiments we have paired-end reads, the first containing only barcodes and the second the transcripts. Would it be possible to apply filtering only to the second reads and then remove the corresponding first reads? Otherwise the reads will be out of sync and cause problems down-stream.
Thanks!
Could you please paste some reads of your data here?
No problem, here you go:
R1 (26 bp):
@A00302:38:HG35HDMXX:1:1101:2410:1000 1:N:0:CCACTACA
NTGTAGCTCGATCCCTAGCTTTAGCC
+
#FFFFF:FFFFF:FFFFFFFFFFFFF
@A00302:38:HG35HDMXX:1:1101:2826:1000 1:N:0:CCACTACA
NCTTTCTGTGAGTGACTCTTCTCTTA
+
#FFFFFFFFFFFFFFFFFFFFFFFFF
@A00302:38:HG35HDMXX:1:1101:3531:1000 1:N:0:CCAATACA
NCGTACTAGACCACGAGGGTTATCCT
+
#,,FFFFF:F::F,::FFF,FF:F:,
@A00302:38:HG35HDMXX:1:1101:4833:1000 1:N:0:CCACTACA
NGTGAAGTCTTAGCCCCTGTTTCAGC
+
#FFFFFFFFF,FFFF:FFF:FFFFFF
R2 (91 bp):
@A00302:38:HG35HDMXX:1:1101:2410:1000 2:N:0:CCACTACA
GGGGGAATAAAAAAGTTAAAAAAATAAAAAAAAAAATCTCCCCCAAAAAAACCAAAAAAAAAAACAAAGAAAAAAAGCAAAAAAAATCTTT
+
,F,,,,:,,,::::,:F,:F,F,,,,,:FFFFFF:,:F,,,:,F::,,,,::,:,:FFFFFFF,,,F,:,,,FF,,,,,,F,F,:,,,,,F
@A00302:38:HG35HDMXX:1:1101:2826:1000 2:N:0:CCACTACA
CTCTGTCCTTAAGAAGAGATTGTTACCAAGACTCCAGGCTAGGAGAGATTGCAGTTATCCACCAATCATACAGTGTGCTATGCTTCTGTGC
+
FFFFF::FFFFFF:FF:FF,FFFFFFFF:FFFFFFF,FFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFFFF::FFFF
@A00302:38:HG35HDMXX:1:1101:3531:1000 2:N:0:CCAATACA
GTGGAAGATGAGAAACTTCAAGGCAAGATCAATGATTAGGACAAACAGAAGATTCTTGACAAGTGCAATGAAATCATCAGCTGGCTGGAAA
+
FF:FFF,F,FF:FFFFF::F,FF:FF,FFFFFF,:F,F,FF:FF,:FFFFFFFFFFFFFFFFFFFFFFF,FFF:FFFF:,FFF:,:,FF,F
@A00302:38:HG35HDMXX:1:1101:4833:1000 2:N:0:CCACTACA
GATACCTTGGCTGTGGCCACGGACACAAAGGCCACCCGGGCCGTCCACACTGGTCTTGCTGTGGGAAGTTCATTGAGAAGTCCGAGTGCTC
+
FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFF:FFFF:FFFFFFFFF:F,FFFFFFFFF,FF,F:FFF,FFFFFFFFFF
I hope 4 reads are enough. The read name up until the space is identical between R1 and R2, the part after that differs only in the first character (1
for R1, 2
for R2).
Emm, this is not a common feature request and requires some customization
Sorry, I think my explanation was a bit poor. Let me try again, if you still think it's too specific I understand.
In single cell RNA-seq experiments it is very common to have paired-end sequencing. The first read contains only barcodes, in the case of 10x Genomics kits (which are very common as well) it 1 barcode for the cell (16 bp) and 1 barcode for each transcript (10 bp), resulting in a total of 26 bp. It would be silly to apply filtering and trimming to this read because these sequences are artificial and always have to be the same length. Instead, the second read comes from the actual transcript, therefore it makes sense to apply filtering and trimming for quality control.
At our institute, we use a Illumina NovaSeq sequencer, I would say it's a fairly common machine. I receive two FASTQ files, one containing all read 1, and the other containing all read 2, of course inside the corresponding pairs are in the same order.
As described above, if I run fastp on both FASTQ files I run into problems because barcodes are filtered or trimmed. Instead, if I submit only the FASTQ file of the read 2 (from the transcript), the output FASTQ will have fewer reads than read 1 (because it wasn't filtered). This will cause problems later on because the paired FASTQ files are expected to be in sync.
When I say 'in sync', I mean that the read pairs are in the same order in their respective FASTQ file, and that only pairs are present. That means, if a read was removed from the R2 FASTQ file, the corresponding read in the R1 FASTQ must be removed as well. Otherwise, you'll end up having different number of reads in the FASTQ files.
So I was wondering, since your tool is already able to process paired-end FASTQ files and keep them in sync, if it was possible to use both FASTQ files as input for fastp, but then apply the filtering only to the reads of one of the two files (in this case read 2) and just keep the other one in sync. Technically, I assume this shouldn't be a big challenge. I would like to look at this myself but I'm not very familiar with C++ and don't have spare time at the moment.
After running fastp only on read 2 (in single-end mode), I tried to 'sync' the files manually myself but wasn't able to make it work.
Anyway, sorry for the wall of text but I hope this better explains the situations.As I said, I think it shouldn't be too challenging and in my opinion a lot of people could benefit from this (also because single cell RNA-seq experiments are becoming more and more standard and almost always require paired-end sequencing, often having one of the reads containing just barcodes).
Thanks for your time 👍
Good, thanks for your clear explanation, I will make it happen in future release.
@sfchen Thanks a lot 🙏
I too would like to see this for single cell data.
I am not sure if this is implemented. There are some flags for UMI processing
, but I am not sure if they apply.
Hi there,
I've had the same use case spring up where I've only wanted to trim + filter based on a single biological read, and not associated barcode, umi, etc reads. For anyone interested, I found a pretty easy work around using fastq-pair.
Assume we have r1.fq
and r2.fq
. We want to filter both r1
and r2
based on reads in r1
:
- Run
fastp
onr1.fq
to filter as you desire. Output tofastp_r1.fastq
- Match passing reads in
fastp_r1.fastq
with reads inr2.fq
usingfastq-pair
.
Hope this helps!
Yes that's also how I ended up fixing it! Thanks for actually putting the answer here :smile:
Hope to integrated into fastp